Hi everyone,
While I'm working with a tiered memory system e.g. CXL memory, I have
been facing migration overhead esp. TLB shootdown on promotion or
demotion between different tiers. Yeah.. most TLB shootdowns on
migration through hinting fault can be avoided thanks to Huang Ying's
work, commit 4d4b6d66db ("mm,unmap: avoid flushing TLB in batch if PTE
is inaccessible"). See the following link:
https://lore.kernel.org/lkml/[email protected]/
However, it's only for ones using hinting fault. I thought it'd be much
better if we have a general mechanism to reduce the number of TLB
flushes and TLB misses, that we can ultimately apply to any type of
migration, I tried it only for tiering for now tho.
I'm suggesting a mechanism called MIGRC that stands for 'Migration Read
Copy', to reduce TLB flushes by keeping source and destination of folios
participated in the migrations until all TLB flushes required are done,
only if those folios are not mapped with write permission PTE entries.
To achieve that:
1. For the folios that map only to non-writable TLB entries, prevent
TLB flush at migration by keeping both source and destination
folios, which will be handled later at a better time.
2. When any non-writable TLB entry changes to writable e.g. through
fault handler, give up migrc mechanism so as to perform TLB flush
required right away.
I observed a big improvement of TLB flushes # and TLB misses # at the
following evaluation using XSBench like:
1. itlb flush was reduced by 93.9%.
2. dtlb thread was reduced by 43.5%.
3. stlb flush was reduced by 24.9%.
4. dtlb store misses was reduced by 34.2%.
5. itlb load misses was reduced by 45.5%.
6. The runtime was reduced by 3.5%.
I believe that it would help more with any real cases.
---
The measurement result:
Architecture - x86_64
QEMU - kvm enabled, host cpu
Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
Linux Kernel - v6.7, numa balancing tiering on, demotion enabled
Benchmark - XSBench -p 100000000 (-p option makes the runtime longer)
run 'perf stat' using events:
1) itlb.itlb_flush
2) tlb_flush.dtlb_thread
3) tlb_flush.stlb_any
4) dTLB-load-misses
5) dTLB-store-misses
6) iTLB-load-misses
run 'cat /proc/vmstat' and pick:
1) numa_pages_migrated
2) pgmigrate_success
3) nr_tlb_remote_flush
4) nr_tlb_remote_flush_received
5) nr_tlb_local_flush_all
6) nr_tlb_local_flush_one
BEFORE - mainline v6.7
----------------------
$ perf stat -a \
-e itlb.itlb_flush \
-e tlb_flush.dtlb_thread \
-e tlb_flush.stlb_any \
-e dTLB-load-misses \
-e dTLB-store-misses \
-e iTLB-load-misses \
./XSBench -p 100000000
Performance counter stats for 'system wide':
85647229 itlb.itlb_flush
480981504 tlb_flush.dtlb_thread
323937200 tlb_flush.stlb_any
238381632579 dTLB-load-misses
601514255 dTLB-store-misses
2974157461 iTLB-load-misses
2252.883892112 seconds time elapsed
$ cat /proc/vmstat
...
numa_pages_migrated 12790664
pgmigrate_success 26835314
nr_tlb_remote_flush 3031412
nr_tlb_remote_flush_received 45234862
nr_tlb_local_flush_all 216584
nr_tlb_local_flush_one 740940
...
AFTER - mainline v6.7 + migrc
-----------------------------
$ perf stat -a \
-e itlb.itlb_flush \
-e tlb_flush.dtlb_thread \
-e tlb_flush.stlb_any \
-e dTLB-load-misses \
-e dTLB-store-misses \
-e iTLB-load-misses \
./XSBench -p 100000000
Performance counter stats for 'system wide':
5240261 itlb.itlb_flush
271581774 tlb_flush.dtlb_thread
243149389 tlb_flush.stlb_any
234502983364 dTLB-load-misses
395673680 dTLB-store-misses
1620215163 iTLB-load-misses
2172.283436287 seconds time elapsed
$ cat /proc/vmstat
...
numa_pages_migrated 14897064
pgmigrate_success 30825530
nr_tlb_remote_flush 198290
nr_tlb_remote_flush_received 2820156
nr_tlb_local_flush_all 92048
nr_tlb_local_flush_one 741401
...
---
Changes from v7:
1. Rewrite cover letter to explain what 'migrc' mechasism is.
(feedbacked by Andrew Morton)
2. Supplement the commit message of a patch 'mm: Add APIs to
free a folio directly to the buddy bypassing pcp'.
(feedbacked by Andrew Morton)
Changes from v6:
1. Fix build errors in case of
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH disabled by moving
migrc_flush_{start,end}() calls from arch code to
try_to_unmap_flush() in mm/rmap.c.
Changes from v5:
1. Fix build errors in case of CONFIG_MIGRATION disabled or
CONFIG_HWPOISON_INJECT moduled. (feedbacked by kernel test
bot and Raymond Jay Golo)
2. Organize migrc code with two kconfigs, CONFIG_MIGRATION and
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
Changes from v4:
1. Rebase on v6.7.
2. Fix build errors in arm64 that is doing nothing for TLB flush
but has CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH. (reported
by kernel test robot)
3. Don't use any page flag. So the system would give up migrc
mechanism more often but it's okay. The final improvement is
good enough.
4. Instead, optimize full TLB flush(arch_tlbbatch_flush()) by
avoiding redundant CPUs from TLB flush.
Changes from v3:
1. Don't use the kconfig, CONFIG_MIGRC, and remove sysctl knob,
migrc_enable. (feedbacked by Nadav)
2. Remove the optimization skipping CPUs that have already
performed TLB flushes needed by any reason when performing
TLB flushes by migrc because I can't tell the performance
difference between w/ the optimization and w/o that.
(feedbacked by Nadav)
3. Minimize arch-specific code. While at it, move all the migrc
declarations and inline functions from include/linux/mm.h to
mm/internal.h (feedbacked by Dave Hansen, Nadav)
4. Separate a part making migrc paused when the system is in
high memory pressure to another patch. (feedbacked by Nadav)
5. Rename:
a. arch_tlbbatch_clean() to arch_tlbbatch_clear(),
b. tlb_ubc_nowr to tlb_ubc_ro,
c. migrc_try_flush_free_folios() to migrc_flush_free_folios(),
d. migrc_stop to migrc_pause.
(feedbacked by Nadav)
6. Use ->lru list_head instead of introducing a new llist_head.
(feedbacked by Nadav)
7. Use non-atomic operations of page-flag when it's safe.
(feedbacked by Nadav)
8. Use stack instead of keeping a pointer of 'struct migrc_req'
in struct task, which is for manipulating it locally.
(feedbacked by Nadav)
9. Replace a lot of simple functions to inline functions placed
in a header, mm/internal.h. (feedbacked by Nadav)
10. Add additional sufficient comments. (feedbacked by Nadav)
11. Remove a lot of wrapper functions. (feedbacked by Nadav)
Changes from RFC v2:
1. Remove additional occupation in struct page. To do that,
unioned with lru field for migrc's list and added a page
flag. I know page flag is a thing that we don't like to add
but no choice because migrc should distinguish folios under
migrc's control from others. Instead, I force migrc to be
used only on 64 bit system to mitigate you guys from getting
angry.
2. Remove meaningless internal object allocator that I
introduced to minimize impact onto the system. However, a ton
of tests showed there was no difference.
3. Stop migrc from working when the system is in high memory
pressure like about to perform direct reclaim. At the
condition where the swap mechanism is heavily used, I found
the system suffered from regression without this control.
4. Exclude folios that pte_dirty() == true from migrc's interest
so that migrc can work simpler.
5. Combine several patches that work tightly coupled to one.
6. Add sufficient comments for better review.
7. Manage migrc's request in per-node manner (from globally).
8. Add TLB miss improvement in commit message.
9. Test with more CPUs(4 -> 16) to see bigger improvement.
Changes from RFC:
1. Fix a bug triggered when a destination folio at the previous
migration becomes a source folio at the next migration,
before the folio gets handled properly so that the folio can
play with another migration. There was inconsistency in the
folio's state. Fixed it.
2. Split the patch set into more pieces so that the folks can
review better. (Feedbacked by Nadav Amit)
3. Fix a wrong usage of barrier e.g. smp_mb__after_atomic().
(Feedbacked by Nadav Amit)
4. Tried to add sufficient comments to explain the patch set
better. (Feedbacked by Nadav Amit)
Byungchul Park (8):
x86/tlb: Add APIs manipulating tlb batch's arch data
arm64: tlbflush: Add APIs manipulating tlb batch's arch data
mm/rmap: Recognize read-only TLB entries during batched TLB flush
x86/tlb, mm/rmap: Separate arch_tlbbatch_clear() out of
arch_tlbbatch_flush()
mm: Separate move/undo doing on folio list from migrate_pages_batch()
mm: Add APIs to free a folio directly to the buddy bypassing pcp
mm: Defer TLB flush by keeping both src and dst folios at migration
mm: Pause migrc mechanism at high memory pressure
arch/arm64/include/asm/tlbflush.h | 19 ++
arch/x86/include/asm/tlbflush.h | 18 ++
arch/x86/mm/tlb.c | 2 -
include/linux/mm.h | 23 ++
include/linux/mmzone.h | 7 +
include/linux/sched.h | 9 +
mm/internal.h | 78 ++++++
mm/memory.c | 8 +
mm/migrate.c | 411 ++++++++++++++++++++++++++----
mm/page_alloc.c | 34 ++-
mm/rmap.c | 40 ++-
mm/swap.c | 7 +
12 files changed, 597 insertions(+), 59 deletions(-)
base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
--
2.17.1
This is a preparation for migrc mechanism that requires to avoid
redundant TLB flushes by manipulating tlb batch's arch data after
arch_tlbbatch_flush(). However, we cannot because the data is getting
cleared inside arch_tlbbatch_flush(). So separated the part clearing the
tlb batch's arch data out of arch_tlbbatch_flush().
Signed-off-by: Byungchul Park <[email protected]>
---
arch/x86/mm/tlb.c | 2 --
mm/rmap.c | 1 +
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 453ea95b667d..941f41df02f3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1268,8 +1268,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
local_irq_enable();
}
- cpumask_clear(&batch->cpumask);
-
put_flush_tlb_info();
put_cpu();
}
diff --git a/mm/rmap.c b/mm/rmap.c
index da36f23ff7b0..b484d659d0c1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -643,6 +643,7 @@ void try_to_unmap_flush(void)
return;
arch_tlbbatch_flush(&tlb_ubc->arch);
+ arch_tlbbatch_clear(&tlb_ubc->arch);
tlb_ubc->flush_required = false;
tlb_ubc->writable = false;
}
--
2.17.1
Functionally, no change. This is a preparation for migrc mechanism that
requires to use separate folio lists for its own handling at migration.
Refactored migrate_pages_batch() and separated move and undo parts
operating on folio list, from migrate_pages_batch().
Signed-off-by: Byungchul Park <[email protected]>
---
mm/migrate.c | 134 +++++++++++++++++++++++++++++++--------------------
1 file changed, 83 insertions(+), 51 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 397f2a6e34cb..bbe1ecef4956 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1611,6 +1611,81 @@ static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
return nr_failed;
}
+static void migrate_folios_move(struct list_head *src_folios,
+ struct list_head *dst_folios,
+ free_folio_t put_new_folio, unsigned long private,
+ enum migrate_mode mode, int reason,
+ struct list_head *ret_folios,
+ struct migrate_pages_stats *stats,
+ int *retry, int *thp_retry, int *nr_failed,
+ int *nr_retry_pages)
+{
+ struct folio *folio, *folio2, *dst, *dst2;
+ bool is_thp;
+ int nr_pages;
+ int rc;
+
+ dst = list_first_entry(dst_folios, struct folio, lru);
+ dst2 = list_next_entry(dst, lru);
+ list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+ is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
+ nr_pages = folio_nr_pages(folio);
+
+ cond_resched();
+
+ rc = migrate_folio_move(put_new_folio, private,
+ folio, dst, mode,
+ reason, ret_folios);
+ /*
+ * The rules are:
+ * Success: folio will be freed
+ * -EAGAIN: stay on the unmap_folios list
+ * Other errno: put on ret_folios list
+ */
+ switch(rc) {
+ case -EAGAIN:
+ *retry += 1;
+ *thp_retry += is_thp;
+ *nr_retry_pages += nr_pages;
+ break;
+ case MIGRATEPAGE_SUCCESS:
+ stats->nr_succeeded += nr_pages;
+ stats->nr_thp_succeeded += is_thp;
+ break;
+ default:
+ *nr_failed += 1;
+ stats->nr_thp_failed += is_thp;
+ stats->nr_failed_pages += nr_pages;
+ break;
+ }
+ dst = dst2;
+ dst2 = list_next_entry(dst, lru);
+ }
+}
+
+static void migrate_folios_undo(struct list_head *src_folios,
+ struct list_head *dst_folios,
+ free_folio_t put_new_folio, unsigned long private,
+ struct list_head *ret_folios)
+{
+ struct folio *folio, *folio2, *dst, *dst2;
+
+ dst = list_first_entry(dst_folios, struct folio, lru);
+ dst2 = list_next_entry(dst, lru);
+ list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+ int old_page_state = 0;
+ struct anon_vma *anon_vma = NULL;
+
+ __migrate_folio_extract(dst, &old_page_state, &anon_vma);
+ migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
+ anon_vma, true, ret_folios);
+ list_del(&dst->lru);
+ migrate_folio_undo_dst(dst, true, put_new_folio, private);
+ dst = dst2;
+ dst2 = list_next_entry(dst, lru);
+ }
+}
+
/*
* migrate_pages_batch() first unmaps folios in the from list as many as
* possible, then move the unmapped folios.
@@ -1633,7 +1708,7 @@ static int migrate_pages_batch(struct list_head *from,
int pass = 0;
bool is_thp = false;
bool is_large = false;
- struct folio *folio, *folio2, *dst = NULL, *dst2;
+ struct folio *folio, *folio2, *dst = NULL;
int rc, rc_saved = 0, nr_pages;
LIST_HEAD(unmap_folios);
LIST_HEAD(dst_folios);
@@ -1769,42 +1844,11 @@ static int migrate_pages_batch(struct list_head *from,
thp_retry = 0;
nr_retry_pages = 0;
- dst = list_first_entry(&dst_folios, struct folio, lru);
- dst2 = list_next_entry(dst, lru);
- list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
- is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
- nr_pages = folio_nr_pages(folio);
-
- cond_resched();
-
- rc = migrate_folio_move(put_new_folio, private,
- folio, dst, mode,
- reason, ret_folios);
- /*
- * The rules are:
- * Success: folio will be freed
- * -EAGAIN: stay on the unmap_folios list
- * Other errno: put on ret_folios list
- */
- switch(rc) {
- case -EAGAIN:
- retry++;
- thp_retry += is_thp;
- nr_retry_pages += nr_pages;
- break;
- case MIGRATEPAGE_SUCCESS:
- stats->nr_succeeded += nr_pages;
- stats->nr_thp_succeeded += is_thp;
- break;
- default:
- nr_failed++;
- stats->nr_thp_failed += is_thp;
- stats->nr_failed_pages += nr_pages;
- break;
- }
- dst = dst2;
- dst2 = list_next_entry(dst, lru);
- }
+ /* Move the unmapped folios */
+ migrate_folios_move(&unmap_folios, &dst_folios,
+ put_new_folio, private, mode, reason,
+ ret_folios, stats, &retry, &thp_retry,
+ &nr_failed, &nr_retry_pages);
}
nr_failed += retry;
stats->nr_thp_failed += thp_retry;
@@ -1813,20 +1857,8 @@ static int migrate_pages_batch(struct list_head *from,
rc = rc_saved ? : nr_failed;
out:
/* Cleanup remaining folios */
- dst = list_first_entry(&dst_folios, struct folio, lru);
- dst2 = list_next_entry(dst, lru);
- list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
- int old_page_state = 0;
- struct anon_vma *anon_vma = NULL;
-
- __migrate_folio_extract(dst, &old_page_state, &anon_vma);
- migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
- anon_vma, true, ret_folios);
- list_del(&dst->lru);
- migrate_folio_undo_dst(dst, true, put_new_folio, private);
- dst = dst2;
- dst2 = list_next_entry(dst, lru);
- }
+ migrate_folios_undo(&unmap_folios, &dst_folios,
+ put_new_folio, private, ret_folios);
return rc;
}
--
2.17.1
Regression was observed when the system is in high memory pressure with
swap on, where migrc might keep a number of folios in its pending queue,
which possibly makes it worse. So temporarily prevented migrc from
working on that condition.
Signed-off-by: Byungchul Park <[email protected]>
---
mm/internal.h | 20 ++++++++++++++++++++
mm/migrate.c | 18 +++++++++++++++++-
mm/page_alloc.c | 13 +++++++++++++
3 files changed, 50 insertions(+), 1 deletion(-)
diff --git a/mm/internal.h b/mm/internal.h
index ab02cb8306e2..55781f879fb2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1285,6 +1285,8 @@ static inline void shrinker_debugfs_remove(struct dentry *debugfs_entry,
#endif /* CONFIG_SHRINKER_DEBUG */
#if defined(CONFIG_MIGRATION) && defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+extern atomic_t migrc_pause_cnt;
+
/*
* Reset the indicator indicating there are no writable mappings at the
* beginning of every rmap traverse for unmap. Migrc can work only when
@@ -1313,6 +1315,21 @@ static inline bool can_migrc_test(void)
return current->can_migrc && current->tlb_ubc_ro.flush_required;
}
+static inline void migrc_pause(void)
+{
+ atomic_inc(&migrc_pause_cnt);
+}
+
+static inline void migrc_resume(void)
+{
+ atomic_dec(&migrc_pause_cnt);
+}
+
+static inline bool migrc_paused(void)
+{
+ return !!atomic_read(&migrc_pause_cnt);
+}
+
/*
* Return the number of folios pending TLB flush that have yet to get
* freed in the zone.
@@ -1332,6 +1349,9 @@ void migrc_flush_end(struct tlbflush_unmap_batch *batch);
static inline void can_migrc_init(void) {}
static inline void can_migrc_fail(void) {}
static inline bool can_migrc_test(void) { return false; }
+static inline void migrc_pause(void) {}
+static inline void migrc_resume(void) {}
+static inline bool migrc_paused(void) { return false; }
static inline int migrc_pending_nr_in_zone(struct zone *z) { return 0; }
static inline bool migrc_flush_free_folios(void) { return false; }
static inline void migrc_flush_start(void) {}
diff --git a/mm/migrate.c b/mm/migrate.c
index cbe5372f159e..fbc8586ed735 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -62,6 +62,12 @@ static struct tlbflush_unmap_batch migrc_ubc;
static LIST_HEAD(migrc_folios);
static DEFINE_SPINLOCK(migrc_lock);
+/*
+ * Increase on entry of handling high memory pressure e.g. direct
+ * reclaim, decrease on the exit. See __alloc_pages_slowpath().
+ */
+atomic_t migrc_pause_cnt = ATOMIC_INIT(0);
+
static void init_tlb_ubc(struct tlbflush_unmap_batch *ubc)
{
arch_tlbbatch_clear(&ubc->arch);
@@ -1922,7 +1928,8 @@ static int migrate_pages_batch(struct list_head *from,
*/
init_tlb_ubc(&pending_ubc);
do_migrc = IS_ENABLED(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH) &&
- (reason == MR_DEMOTION || reason == MR_NUMA_MISPLACED);
+ (reason == MR_DEMOTION || reason == MR_NUMA_MISPLACED) &&
+ !migrc_paused();
for (pass = 0; pass < nr_pass && retry; pass++) {
retry = 0;
@@ -1961,6 +1968,15 @@ static int migrate_pages_batch(struct list_head *from,
continue;
}
+ /*
+ * In case that the system is in high memory
+ * pressure, give up migrc mechanism this turn.
+ */
+ if (unlikely(do_migrc && migrc_paused())) {
+ fold_ubc(tlb_ubc, &pending_ubc);
+ do_migrc = false;
+ }
+
can_migrc_init();
rc = migrate_folio_unmap(get_new_folio, put_new_folio,
private, folio, &dst, mode, reason,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6ef0c22b1109..366777afce7f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4072,6 +4072,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned int cpuset_mems_cookie;
unsigned int zonelist_iter_cookie;
int reserve_flags;
+ bool migrc_paused = false;
restart:
compaction_retries = 0;
@@ -4203,6 +4204,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (page)
goto got_pg;
+ /*
+ * The system is in very high memory pressure. Pause migrc from
+ * expanding its pending queue temporarily.
+ */
+ if (!migrc_paused) {
+ migrc_pause();
+ migrc_paused = true;
+ migrc_flush_free_folios();
+ }
+
/* Caller is not willing to reclaim, we can't balance anything */
if (!can_direct_reclaim)
goto nopage;
@@ -4330,6 +4341,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
warn_alloc(gfp_mask, ac->nodemask,
"page allocation failure: order:%u", order);
got_pg:
+ if (migrc_paused)
+ migrc_resume();
return page;
}
--
2.17.1
Implementation of MIGRC mechanism that stands for 'Migration Read Copy'.
We always face the migration overhead at either promotion or demotion,
while working with tiered memory e.g. CXL memory and found out TLB
shootdown is a quite big one that is needed to get rid of if possible.
Fortunately, TLB flush can be defered if both source and destination of
folios during migration are kept until all TLB flushes required will
have been done, of course, only if the target PTE entries have read-only
permission, more precisely speaking, don't have write permission.
Otherwise, no doubt the folio might get messed up.
To achieve that:
1. For the folios that map only to non-writable TLB entries, prevent
TLB flush at migration by keeping both source and destination
folios, which will be handled later at a better time.
2. When any non-writable TLB entry changes to writable e.g. through
fault handler, give up migrc mechanism so as to perform TLB flush
required right away.
The following evaluation using XSBench shows the improvement like:
1. itlb flush was reduced by 93.9%.
2. dtlb thread was reduced by 43.5%.
3. stlb flush was reduced by 24.9%.
4. dtlb store misses was reduced by 34.2%.
5. itlb load misses was reduced by 45.5%.
6. The runtime was reduced by 3.5%.
The measurement result:
Architecture - x86_64
QEMU - kvm enabled, host cpu
Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
Linux Kernel - v6.7, numa balancing tiering on, demotion enabled
Benchmark - XSBench -p 100000000 (-p option makes the runtime longer)
run 'perf stat' using events:
1) itlb.itlb_flush
2) tlb_flush.dtlb_thread
3) tlb_flush.stlb_any
4) dTLB-load-misses
5) dTLB-store-misses
6) iTLB-load-misses
run 'cat /proc/vmstat' and pick:
1) numa_pages_migrated
2) pgmigrate_success
3) nr_tlb_remote_flush
4) nr_tlb_remote_flush_received
5) nr_tlb_local_flush_all
6) nr_tlb_local_flush_one
BEFORE - mainline v6.7
----------------------
$ perf stat -a \
-e itlb.itlb_flush \
-e tlb_flush.dtlb_thread \
-e tlb_flush.stlb_any \
-e dTLB-load-misses \
-e dTLB-store-misses \
-e iTLB-load-misses \
./XSBench -p 100000000
Performance counter stats for 'system wide':
85647229 itlb.itlb_flush
480981504 tlb_flush.dtlb_thread
323937200 tlb_flush.stlb_any
238381632579 dTLB-load-misses
601514255 dTLB-store-misses
2974157461 iTLB-load-misses
2252.883892112 seconds time elapsed
$ cat /proc/vmstat
...
numa_pages_migrated 12790664
pgmigrate_success 26835314
nr_tlb_remote_flush 3031412
nr_tlb_remote_flush_received 45234862
nr_tlb_local_flush_all 216584
nr_tlb_local_flush_one 740940
...
AFTER - mainline v6.7 + migrc
-----------------------------
$ perf stat -a \
-e itlb.itlb_flush \
-e tlb_flush.dtlb_thread \
-e tlb_flush.stlb_any \
-e dTLB-load-misses \
-e dTLB-store-misses \
-e iTLB-load-misses \
./XSBench -p 100000000
Performance counter stats for 'system wide':
5240261 itlb.itlb_flush
271581774 tlb_flush.dtlb_thread
243149389 tlb_flush.stlb_any
234502983364 dTLB-load-misses
395673680 dTLB-store-misses
1620215163 iTLB-load-misses
2172.283436287 seconds time elapsed
$ cat /proc/vmstat
...
numa_pages_migrated 14897064
pgmigrate_success 30825530
nr_tlb_remote_flush 198290
nr_tlb_remote_flush_received 2820156
nr_tlb_local_flush_all 92048
nr_tlb_local_flush_one 741401
...
Signed-off-by: Byungchul Park <[email protected]>
---
include/linux/mmzone.h | 7 ++
include/linux/sched.h | 8 ++
mm/internal.h | 53 ++++++++
mm/memory.c | 8 ++
mm/migrate.c | 271 +++++++++++++++++++++++++++++++++++++++--
mm/page_alloc.c | 11 +-
mm/rmap.c | 12 +-
7 files changed, 358 insertions(+), 12 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9db36e197712..492111cd1176 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1002,6 +1002,13 @@ struct zone {
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
+
+#if defined(CONFIG_MIGRATION) && defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+ /*
+ * the number of folios pending for TLB flush in the zone
+ */
+ atomic_t migrc_pending_nr;
+#endif
} ____cacheline_internodealigned_in_smp;
enum pgdat_flags {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0317e7a65151..d8c285309a8f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1330,6 +1330,14 @@ struct task_struct {
struct tlbflush_unmap_batch tlb_ubc;
struct tlbflush_unmap_batch tlb_ubc_ro;
+#if defined(CONFIG_MIGRATION) && defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+ /*
+ * whether all the mappings of a folio during unmap are read-only
+ * so that migrc can work on the folio
+ */
+ bool can_migrc;
+#endif
+
/* Cache last used pipe for splice(): */
struct pipe_inode_info *splice_pipe;
diff --git a/mm/internal.h b/mm/internal.h
index 3be8fd5604e8..ab02cb8306e2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1284,4 +1284,57 @@ static inline void shrinker_debugfs_remove(struct dentry *debugfs_entry,
}
#endif /* CONFIG_SHRINKER_DEBUG */
+#if defined(CONFIG_MIGRATION) && defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+/*
+ * Reset the indicator indicating there are no writable mappings at the
+ * beginning of every rmap traverse for unmap. Migrc can work only when
+ * all the mappings are read-only.
+ */
+static inline void can_migrc_init(void)
+{
+ current->can_migrc = true;
+}
+
+/*
+ * Mark the folio is not applicable to migrc, once it found a writble or
+ * dirty pte during rmap traverse for unmap.
+ */
+static inline void can_migrc_fail(void)
+{
+ current->can_migrc = false;
+}
+
+/*
+ * Check if all the mappings are read-only and read-only mappings even
+ * exist.
+ */
+static inline bool can_migrc_test(void)
+{
+ return current->can_migrc && current->tlb_ubc_ro.flush_required;
+}
+
+/*
+ * Return the number of folios pending TLB flush that have yet to get
+ * freed in the zone.
+ */
+static inline int migrc_pending_nr_in_zone(struct zone *z)
+{
+ return atomic_read(&z->migrc_pending_nr);
+}
+
+/*
+ * Perform TLB flush needed and free the folios under migrc's control.
+ */
+bool migrc_flush_free_folios(void);
+void migrc_flush_start(void);
+void migrc_flush_end(struct tlbflush_unmap_batch *batch);
+#else /* CONFIG_MIGRATION && CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+static inline void can_migrc_init(void) {}
+static inline void can_migrc_fail(void) {}
+static inline bool can_migrc_test(void) { return false; }
+static inline int migrc_pending_nr_in_zone(struct zone *z) { return 0; }
+static inline bool migrc_flush_free_folios(void) { return false; }
+static inline void migrc_flush_start(void) {}
+static inline void migrc_flush_end(struct tlbflush_unmap_batch *batch) {}
+#endif
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 6e0712d06cd4..e67de161da8b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3462,6 +3462,14 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
if (vmf->page)
folio = page_folio(vmf->page);
+ /*
+ * The folio may or may not be one that is under migrc's control
+ * and about to change its permission from read-only to writable.
+ * Conservatively give up deferring TLB flush just in case.
+ */
+ if (folio)
+ migrc_flush_free_folios();
+
/*
* Shared mapping: we are guaranteed to have VM_WRITE and
* FAULT_FLAG_WRITE set at this point.
diff --git a/mm/migrate.c b/mm/migrate.c
index bbe1ecef4956..cbe5372f159e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -57,6 +57,194 @@
#include "internal.h"
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+static struct tlbflush_unmap_batch migrc_ubc;
+static LIST_HEAD(migrc_folios);
+static DEFINE_SPINLOCK(migrc_lock);
+
+static void init_tlb_ubc(struct tlbflush_unmap_batch *ubc)
+{
+ arch_tlbbatch_clear(&ubc->arch);
+ ubc->flush_required = false;
+ ubc->writable = false;
+}
+
+static void migrc_keep_folio(struct folio *f, struct list_head *h)
+{
+ list_move_tail(&f->lru, h);
+ folio_get(f);
+ atomic_inc(&folio_zone(f)->migrc_pending_nr);
+}
+
+static void migrc_undo_folio(struct folio *f)
+{
+ list_del(&f->lru);
+ folio_put(f);
+ atomic_dec(&folio_zone(f)->migrc_pending_nr);
+}
+
+static void migrc_release_folio(struct folio *f)
+{
+ folio_put_small_nopcp(f);
+ atomic_dec(&folio_zone(f)->migrc_pending_nr);
+}
+
+/*
+ * Need to synchronize between TLB flush and managing pending CPUs in
+ * migrc_ubc. Take a look at the following scenario:
+ *
+ * CPU0 CPU1
+ * ---- ----
+ * TLB flush
+ * Unmap folios (needing TLB flush)
+ * Add pending CPUs to migrc_ubc
+ * Clear the CPUs from migrc_ubc
+ *
+ * The pending CPUs added in CPU1 should not be cleared from migrc_ubc
+ * in CPU0 because the TLB flush for migrc_ubc added in CPU1 has not
+ * been performed this turn. To avoid this, using 'migrc_flushing'
+ * variable, prevent adding pending CPUs to migrc_ubc and give up migrc
+ * mechanism if others are in the middle of TLB flush, like:
+ *
+ * CPU0 CPU1
+ * ---- ----
+ * migrc_flushing++
+ * TLB flush
+ * Unmap folios (needing TLB flush)
+ * If migrc_flushing == 0:
+ * Add pending CPUs to migrc_ubc
+ * Else: <--- hit
+ * Give up migrc mechanism
+ * Clear the CPUs from migrc_ubc
+ * migrc_flush--
+ *
+ * Only the following case would be allowed for migrc mechanism to work:
+ *
+ * CPU0 CPU1
+ * ---- ----
+ * Unmap folios (needing TLB flush)
+ * If migrc_flushing == 0: <--- hit
+ * Add pending CPUs to migrc_ubc
+ * Else:
+ * Give up migrc mechanism
+ * migrc_flushing++
+ * TLB flush
+ * Clear the CPUs from migrc_ubc
+ * migrc_flush--
+ */
+static int migrc_flushing;
+
+static bool migrc_add_pending_ubc(struct tlbflush_unmap_batch *ubc)
+{
+ struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc;
+ unsigned long flags;
+
+ spin_lock_irqsave(&migrc_lock, flags);
+ if (migrc_flushing) {
+ spin_unlock_irqrestore(&migrc_lock, flags);
+
+ /*
+ * Give up migrc mechanism. Just let TLB flush needed
+ * handled by try_to_unmap_flush() at the caller side.
+ */
+ fold_ubc(tlb_ubc, ubc);
+ return false;
+ }
+ fold_ubc(&migrc_ubc, ubc);
+ spin_unlock_irqrestore(&migrc_lock, flags);
+ return true;
+}
+
+static bool migrc_add_pending_folios(struct list_head *folios)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&migrc_lock, flags);
+ if (migrc_flushing) {
+ spin_unlock_irqrestore(&migrc_lock, flags);
+
+ /*
+ * Give up migrc mechanism. The caller should perform
+ * TLB flush needed using migrc_flush_free_folios() and
+ * undo some on the folios e.g. restore folios'
+ * reference count increased by migrc and more.
+ */
+ return false;
+ }
+ list_splice(folios, &migrc_folios);
+ spin_unlock_irqrestore(&migrc_lock, flags);
+ return true;
+}
+
+void migrc_flush_start(void)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&migrc_lock, flags);
+ migrc_flushing++;
+ spin_unlock_irqrestore(&migrc_lock, flags);
+}
+
+void migrc_flush_end(struct tlbflush_unmap_batch *batch)
+{
+ LIST_HEAD(folios);
+ struct folio *f, *f2;
+ unsigned long flags;
+
+ spin_lock_irqsave(&migrc_lock, flags);
+ if (!arch_tlbbatch_done(&migrc_ubc.arch, &batch->arch)) {
+ list_splice_init(&migrc_folios, &folios);
+ migrc_ubc.flush_required = false;
+ migrc_ubc.writable = false;
+ }
+ migrc_flushing--;
+ spin_unlock_irqrestore(&migrc_lock, flags);
+
+ list_for_each_entry_safe(f, f2, &folios, lru)
+ migrc_release_folio(f);
+}
+
+bool migrc_flush_free_folios(void)
+{
+ struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc;
+ LIST_HEAD(folios);
+ struct folio *f, *f2;
+ unsigned long flags;
+ bool ret = true;
+
+ spin_lock_irqsave(&migrc_lock, flags);
+ list_splice_init(&migrc_folios, &folios);
+ fold_ubc(tlb_ubc, &migrc_ubc);
+ spin_unlock_irqrestore(&migrc_lock, flags);
+
+ if (list_empty(&folios))
+ ret = false;
+
+ try_to_unmap_flush();
+ list_for_each_entry_safe(f, f2, &folios, lru)
+ migrc_release_folio(f);
+ return ret;
+}
+#else /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+static void init_tlb_ubc(struct tlbflush_unmap_batch *ubc)
+{
+}
+static void migrc_keep_folio(struct folio *f, struct list_head *h)
+{
+}
+static void migrc_undo_folio(struct folio *f)
+{
+}
+static bool migrc_add_pending_ubc(struct tlbflush_unmap_batch *ubc)
+{
+ return false;
+}
+static bool migrc_add_pending_folios(struct list_head *folios)
+{
+ return false;
+}
+#endif
+
bool isolate_movable_page(struct page *page, isolate_mode_t mode)
{
struct folio *folio = folio_get_nontail_page(page);
@@ -1274,7 +1462,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
struct folio *src, struct folio *dst,
enum migrate_mode mode, enum migrate_reason reason,
- struct list_head *ret)
+ struct list_head *ret, struct list_head *move_succ)
{
int rc;
int old_page_state = 0;
@@ -1321,9 +1509,13 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
/*
* A folio that has been migrated has all references removed
- * and will be freed.
+ * and will be freed, unless it's under migrc's control.
*/
- list_del(&src->lru);
+ if (move_succ)
+ migrc_keep_folio(src, move_succ);
+ else
+ list_del(&src->lru);
+
/* Drop an anon_vma reference if we took one */
if (anon_vma)
put_anon_vma(anon_vma);
@@ -1618,7 +1810,7 @@ static void migrate_folios_move(struct list_head *src_folios,
struct list_head *ret_folios,
struct migrate_pages_stats *stats,
int *retry, int *thp_retry, int *nr_failed,
- int *nr_retry_pages)
+ int *nr_retry_pages, struct list_head *move_succ)
{
struct folio *folio, *folio2, *dst, *dst2;
bool is_thp;
@@ -1635,7 +1827,7 @@ static void migrate_folios_move(struct list_head *src_folios,
rc = migrate_folio_move(put_new_folio, private,
folio, dst, mode,
- reason, ret_folios);
+ reason, ret_folios, move_succ);
/*
* The rules are:
* Success: folio will be freed
@@ -1712,17 +1904,34 @@ static int migrate_pages_batch(struct list_head *from,
int rc, rc_saved = 0, nr_pages;
LIST_HEAD(unmap_folios);
LIST_HEAD(dst_folios);
+ LIST_HEAD(unmap_folios_migrc);
+ LIST_HEAD(dst_folios_migrc);
+ LIST_HEAD(move_succ);
bool nosplit = (reason == MR_NUMA_MISPLACED);
+ struct tlbflush_unmap_batch pending_ubc;
+ struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc;
+ struct tlbflush_unmap_batch *tlb_ubc_ro = ¤t->tlb_ubc_ro;
+ bool do_migrc;
+ bool migrc_ubc_succ;
VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
!list_empty(from) && !list_is_singular(from));
+ /*
+ * Apply migrc only to numa migration for now.
+ */
+ init_tlb_ubc(&pending_ubc);
+ do_migrc = IS_ENABLED(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH) &&
+ (reason == MR_DEMOTION || reason == MR_NUMA_MISPLACED);
+
for (pass = 0; pass < nr_pass && retry; pass++) {
retry = 0;
thp_retry = 0;
nr_retry_pages = 0;
list_for_each_entry_safe(folio, folio2, from, lru) {
+ bool can_migrc;
+
is_large = folio_test_large(folio);
is_thp = is_large && folio_test_pmd_mappable(folio);
nr_pages = folio_nr_pages(folio);
@@ -1752,9 +1961,12 @@ static int migrate_pages_batch(struct list_head *from,
continue;
}
+ can_migrc_init();
rc = migrate_folio_unmap(get_new_folio, put_new_folio,
private, folio, &dst, mode, reason,
ret_folios);
+ can_migrc = do_migrc && can_migrc_test() && !is_large;
+
/*
* The rules are:
* Success: folio will be freed
@@ -1800,7 +2012,8 @@ static int migrate_pages_batch(struct list_head *from,
/* nr_failed isn't updated for not used */
stats->nr_thp_failed += thp_retry;
rc_saved = rc;
- if (list_empty(&unmap_folios))
+ if (list_empty(&unmap_folios) &&
+ list_empty(&unmap_folios_migrc))
goto out;
else
goto move;
@@ -1814,8 +2027,19 @@ static int migrate_pages_batch(struct list_head *from,
stats->nr_thp_succeeded += is_thp;
break;
case MIGRATEPAGE_UNMAP:
- list_move_tail(&folio->lru, &unmap_folios);
- list_add_tail(&dst->lru, &dst_folios);
+ if (can_migrc) {
+ list_move_tail(&folio->lru, &unmap_folios_migrc);
+ list_add_tail(&dst->lru, &dst_folios_migrc);
+
+ /*
+ * Gather ro batch data to add
+ * to migrc_ubc after unmap.
+ */
+ fold_ubc(&pending_ubc, tlb_ubc_ro);
+ } else {
+ list_move_tail(&folio->lru, &unmap_folios);
+ list_add_tail(&dst->lru, &dst_folios);
+ }
break;
default:
/*
@@ -1829,12 +2053,19 @@ static int migrate_pages_batch(struct list_head *from,
stats->nr_failed_pages += nr_pages;
break;
}
+ /*
+ * Done with the current folio. Fold the ro
+ * batch data gathered, to the normal batch.
+ */
+ fold_ubc(tlb_ubc, tlb_ubc_ro);
}
}
nr_failed += retry;
stats->nr_thp_failed += thp_retry;
stats->nr_failed_pages += nr_retry_pages;
move:
+ /* Should be before try_to_unmap_flush() */
+ migrc_ubc_succ = do_migrc && migrc_add_pending_ubc(&pending_ubc);
/* Flush TLBs for all unmapped folios */
try_to_unmap_flush();
@@ -1848,7 +2079,27 @@ static int migrate_pages_batch(struct list_head *from,
migrate_folios_move(&unmap_folios, &dst_folios,
put_new_folio, private, mode, reason,
ret_folios, stats, &retry, &thp_retry,
- &nr_failed, &nr_retry_pages);
+ &nr_failed, &nr_retry_pages, NULL);
+ migrate_folios_move(&unmap_folios_migrc, &dst_folios_migrc,
+ put_new_folio, private, mode, reason,
+ ret_folios, stats, &retry, &thp_retry,
+ &nr_failed, &nr_retry_pages, migrc_ubc_succ ?
+ &move_succ : NULL);
+ }
+
+ /*
+ * In case that migrc_add_pending_ubc() has been added
+ * successfully but migrc_add_pending_folios() does not.
+ */
+ if (migrc_ubc_succ && !migrc_add_pending_folios(&move_succ)) {
+ migrc_flush_free_folios();
+
+ /*
+ * Undo src folios that have been successfully added to
+ * move_succ.
+ */
+ list_for_each_entry_safe(folio, folio2, &move_succ, lru)
+ migrc_undo_folio(folio);
}
nr_failed += retry;
stats->nr_thp_failed += thp_retry;
@@ -1859,6 +2110,8 @@ static int migrate_pages_batch(struct list_head *from,
/* Cleanup remaining folios */
migrate_folios_undo(&unmap_folios, &dst_folios,
put_new_folio, private, ret_folios);
+ migrate_folios_undo(&unmap_folios_migrc, &dst_folios_migrc,
+ put_new_folio, private, ret_folios);
return rc;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 21b8c8cd1673..6ef0c22b1109 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2972,6 +2972,8 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
long min = mark;
int o;
+ free_pages += migrc_pending_nr_in_zone(z);
+
/* free_pages may go negative - that's OK */
free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
@@ -3066,7 +3068,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
long usable_free;
long reserved;
- usable_free = free_pages;
+ usable_free = free_pages + migrc_pending_nr_in_zone(z);
reserved = __zone_watermark_unusable_free(z, 0, alloc_flags);
/* reserved may over estimate high-atomic reserves. */
@@ -3273,6 +3275,13 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
gfp_mask)) {
int ret;
+ if (migrc_pending_nr_in_zone(zone) &&
+ migrc_flush_free_folios() &&
+ zone_watermark_fast(zone, order, mark,
+ ac->highest_zoneidx,
+ alloc_flags, gfp_mask))
+ goto try_this_zone;
+
if (has_unaccepted_memory()) {
if (try_to_accept_memory(zone, order))
goto try_this_zone;
diff --git a/mm/rmap.c b/mm/rmap.c
index b484d659d0c1..39ab0d64665a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -642,7 +642,9 @@ void try_to_unmap_flush(void)
if (!tlb_ubc->flush_required)
return;
+ migrc_flush_start();
arch_tlbbatch_flush(&tlb_ubc->arch);
+ migrc_flush_end(tlb_ubc);
arch_tlbbatch_clear(&tlb_ubc->arch);
tlb_ubc->flush_required = false;
tlb_ubc->writable = false;
@@ -677,9 +679,15 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
if (!pte_accessible(mm, pteval))
return;
- if (pte_write(pteval) || writable)
+ if (pte_write(pteval) || writable) {
tlb_ubc = ¤t->tlb_ubc;
- else
+
+ /*
+ * Migrc cannot work with the folio, once it found a
+ * writable or dirty mapping on it.
+ */
+ can_migrc_fail();
+ } else
tlb_ubc = ¤t->tlb_ubc_ro;
arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
--
2.17.1
Functionally, no change. This is a preparation for migrc mechanism that
requires to recognize read-only TLB entries and makes use of them to
batch more aggressively. Plus, the newly introduced API, fold_ubc() will
be used by migrc mechanism when manipulating tlb batch data.
Signed-off-by: Byungchul Park <[email protected]>
---
include/linux/sched.h | 1 +
mm/internal.h | 4 ++++
mm/rmap.c | 31 ++++++++++++++++++++++++++++++-
3 files changed, 35 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c31697248..0317e7a65151 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1328,6 +1328,7 @@ struct task_struct {
#endif
struct tlbflush_unmap_batch tlb_ubc;
+ struct tlbflush_unmap_batch tlb_ubc_ro;
/* Cache last used pipe for splice(): */
struct pipe_inode_info *splice_pipe;
diff --git a/mm/internal.h b/mm/internal.h
index b61034bd50f5..b880f1e78700 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -923,6 +923,7 @@ extern struct workqueue_struct *mm_percpu_wq;
void try_to_unmap_flush(void);
void try_to_unmap_flush_dirty(void);
void flush_tlb_batched_pending(struct mm_struct *mm);
+void fold_ubc(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src);
#else
static inline void try_to_unmap_flush(void)
{
@@ -933,6 +934,9 @@ static inline void try_to_unmap_flush_dirty(void)
static inline void flush_tlb_batched_pending(struct mm_struct *mm)
{
}
+static inline void fold_ubc(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src)
+{
+}
#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/rmap.c b/mm/rmap.c
index 7a27a2b41802..da36f23ff7b0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -605,6 +605,28 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
}
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+void fold_ubc(struct tlbflush_unmap_batch *dst,
+ struct tlbflush_unmap_batch *src)
+{
+ if (!src->flush_required)
+ return;
+
+ /*
+ * Fold src to dst.
+ */
+ arch_tlbbatch_fold(&dst->arch, &src->arch);
+ dst->writable = dst->writable || src->writable;
+ dst->flush_required = true;
+
+ /*
+ * Reset src.
+ */
+ arch_tlbbatch_clear(&src->arch);
+ src->flush_required = false;
+ src->writable = false;
+}
+
/*
* Flush TLB entries for recently unmapped pages from remote CPUs. It is
* important if a PTE was dirty when it was unmapped that it's flushed
@@ -614,7 +636,9 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
void try_to_unmap_flush(void)
{
struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc;
+ struct tlbflush_unmap_batch *tlb_ubc_ro = ¤t->tlb_ubc_ro;
+ fold_ubc(tlb_ubc, tlb_ubc_ro);
if (!tlb_ubc->flush_required)
return;
@@ -645,13 +669,18 @@ void try_to_unmap_flush_dirty(void)
static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
unsigned long uaddr)
{
- struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc;
+ struct tlbflush_unmap_batch *tlb_ubc;
int batch;
bool writable = pte_dirty(pteval);
if (!pte_accessible(mm, pteval))
return;
+ if (pte_write(pteval) || writable)
+ tlb_ubc = ¤t->tlb_ubc;
+ else
+ tlb_ubc = ¤t->tlb_ubc_ro;
+
arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
tlb_ubc->flush_required = true;
--
2.17.1
This is a preparation for migrc mechanism that needs to recognize
read-only TLB entries during batched TLB flush by separating tlb batch's
arch data into two, one is for read-only entries and the other is for
writable ones, and merging those two when needed.
Migrc also needs to optimize CPUs to flush by clearing ones that have
already performed TLB flush needed.
To support it, added APIs manipulating arch data for x86.
Signed-off-by: Byungchul Park <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 25726893c6f4..fa7e16dbeb44 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -5,6 +5,7 @@
#include <linux/mm_types.h>
#include <linux/mmu_notifier.h>
#include <linux/sched.h>
+#include <linux/cpumask.h>
#include <asm/processor.h>
#include <asm/cpufeature.h>
@@ -293,6 +294,23 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
+{
+ cpumask_clear(&batch->cpumask);
+}
+
+static inline void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+ struct arch_tlbflush_unmap_batch *bsrc)
+{
+ cpumask_or(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+}
+
+static inline bool arch_tlbbatch_done(struct arch_tlbflush_unmap_batch *bdst,
+ struct arch_tlbflush_unmap_batch *bsrc)
+{
+ return cpumask_andnot(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+}
+
static inline bool pte_flags_need_flush(unsigned long oldflags,
unsigned long newflags,
bool ignore_access)
--
2.17.1
This is a preparation for migrc mechanism that frees folios at a better
time. The mechanism will defer the use of folio_put*() for source folios
of migration, that are unlikely to be used and a group of folios will be
freed at once at a later time.
However, this will pollute pcp so as to inexpectedly free_pcppages_bulk()
fresher folios and make pcp get unstable. To facilitate this new
mechanism, an additional API has been added that allows folios under
migrc's control to be freed directly to buddy bypassing pcp.
Signed-off-by: Byungchul Park <[email protected]>
---
include/linux/mm.h | 23 +++++++++++++++++++++++
mm/internal.h | 1 +
mm/page_alloc.c | 10 ++++++++++
mm/swap.c | 7 +++++++
4 files changed, 41 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index da5219b48d52..fc0581cce3a7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1284,6 +1284,7 @@ static inline struct folio *virt_to_folio(const void *x)
}
void __folio_put(struct folio *folio);
+void __folio_put_small_nopcp(struct folio *folio);
void put_pages_list(struct list_head *pages);
@@ -1483,6 +1484,28 @@ static inline void folio_put(struct folio *folio)
__folio_put(folio);
}
+/**
+ * folio_put_small_nopcp - Decrement the reference count on a folio.
+ * @folio: The folio.
+ *
+ * This is only for a single page folio to release directly to the buddy
+ * allocator bypassing pcp.
+ *
+ * If the folio's reference count reaches zero, the memory will be
+ * released back to the page allocator and may be used by another
+ * allocation immediately. Do not access the memory or the struct folio
+ * after calling folio_put_small_nopcp() unless you can be sure that it
+ * wasn't the last reference.
+ *
+ * Context: May be called in process or interrupt context, but not in NMI
+ * context. May be called while holding a spinlock.
+ */
+static inline void folio_put_small_nopcp(struct folio *folio)
+{
+ if (folio_put_testzero(folio))
+ __folio_put_small_nopcp(folio);
+}
+
/**
* folio_put_refs - Reduce the reference count on a folio.
* @folio: The folio.
diff --git a/mm/internal.h b/mm/internal.h
index b880f1e78700..3be8fd5604e8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -451,6 +451,7 @@ extern int user_min_free_kbytes;
extern void free_unref_page(struct page *page, unsigned int order);
extern void free_unref_page_list(struct list_head *list);
+extern void free_pages_nopcp(struct page *page, unsigned int order);
extern void zone_pcp_reset(struct zone *zone);
extern void zone_pcp_disable(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 733732e7e0ba..21b8c8cd1673 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -565,6 +565,16 @@ static inline void free_the_page(struct page *page, unsigned int order)
__free_pages_ok(page, order, FPI_NONE);
}
+void free_pages_nopcp(struct page *page, unsigned int order)
+{
+ /*
+ * This function will be used in case that the pages are too
+ * cold to keep in pcp e.g. migrc mechanism. So it'd better
+ * release the pages to the tail.
+ */
+ __free_pages_ok(page, order, FPI_TO_TAIL);
+}
+
/*
* Higher-order pages are called "compound pages". They are structured thusly:
*
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..3f37496a1184 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -106,6 +106,13 @@ static void __folio_put_small(struct folio *folio)
free_unref_page(&folio->page, 0);
}
+void __folio_put_small_nopcp(struct folio *folio)
+{
+ __page_cache_release(folio);
+ mem_cgroup_uncharge(folio);
+ free_pages_nopcp(&folio->page, 0);
+}
+
static void __folio_put_large(struct folio *folio)
{
/*
--
2.17.1
On Mon, Feb 26, 2024 at 12:06:05PM +0900, Byungchul Park wrote:
> Hi everyone,
>
> While I'm working with a tiered memory system e.g. CXL memory, I have
> been facing migration overhead esp. TLB shootdown on promotion or
> demotion between different tiers. Yeah.. most TLB shootdowns on
> migration through hinting fault can be avoided thanks to Huang Ying's
> work, commit 4d4b6d66db ("mm,unmap: avoid flushing TLB in batch if PTE
> is inaccessible"). See the following link:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> However, it's only for ones using hinting fault. I thought it'd be much
> better if we have a general mechanism to reduce the number of TLB
> flushes and TLB misses, that we can ultimately apply to any type of
> migration, I tried it only for tiering for now tho.
>
> I'm suggesting a mechanism called MIGRC that stands for 'Migration Read
> Copy', to reduce TLB flushes by keeping source and destination of folios
> participated in the migrations until all TLB flushes required are done,
> only if those folios are not mapped with write permission PTE entries.
>
> To achieve that:
>
> 1. For the folios that map only to non-writable TLB entries, prevent
> TLB flush at migration by keeping both source and destination
> folios, which will be handled later at a better time.
>
> 2. When any non-writable TLB entry changes to writable e.g. through
> fault handler, give up migrc mechanism so as to perform TLB flush
> required right away.
>
> I observed a big improvement of TLB flushes # and TLB misses # at the
> following evaluation using XSBench like:
>
> 1. itlb flush was reduced by 93.9%.
> 2. dtlb thread was reduced by 43.5%.
> 3. stlb flush was reduced by 24.9%.
Hi guys,
The TLB flush reduction is 25% ~ 94%, IMO, it's unbelievable.
While modern computer architectures are typically capable of handling a
large number of TLB flush hardware requests without significant
performance degradation, it's still important to minimize the number of
unnecessary hardware events whenever possible.
The impact of excessive TLB flushes on system performance can vary
depending on factors such as the amount of TLB miss overhead your
particular system experiences. Nevertheless, reducing the freqency of
TLB flushes can contribute to greater overall system stablity and
performance.
I'm convinced this mechanism could help your systems operate better with
much less TLB flushes and misses.
Byungchul
> 4. dtlb store misses was reduced by 34.2%.
> 5. itlb load misses was reduced by 45.5%.
> 6. The runtime was reduced by 3.5%.
>
> I believe that it would help more with any real cases.
>
> ---
>
> The measurement result:
>
> Architecture - x86_64
> QEMU - kvm enabled, host cpu
> Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
> Linux Kernel - v6.7, numa balancing tiering on, demotion enabled
> Benchmark - XSBench -p 100000000 (-p option makes the runtime longer)
>
> run 'perf stat' using events:
> 1) itlb.itlb_flush
> 2) tlb_flush.dtlb_thread
> 3) tlb_flush.stlb_any
> 4) dTLB-load-misses
> 5) dTLB-store-misses
> 6) iTLB-load-misses
>
> run 'cat /proc/vmstat' and pick:
> 1) numa_pages_migrated
> 2) pgmigrate_success
> 3) nr_tlb_remote_flush
> 4) nr_tlb_remote_flush_received
> 5) nr_tlb_local_flush_all
> 6) nr_tlb_local_flush_one
>
> BEFORE - mainline v6.7
> ----------------------
> $ perf stat -a \
> -e itlb.itlb_flush \
> -e tlb_flush.dtlb_thread \
> -e tlb_flush.stlb_any \
> -e dTLB-load-misses \
> -e dTLB-store-misses \
> -e iTLB-load-misses \
> ./XSBench -p 100000000
>
> Performance counter stats for 'system wide':
>
> 85647229 itlb.itlb_flush
> 480981504 tlb_flush.dtlb_thread
> 323937200 tlb_flush.stlb_any
> 238381632579 dTLB-load-misses
> 601514255 dTLB-store-misses
> 2974157461 iTLB-load-misses
>
> 2252.883892112 seconds time elapsed
>
> $ cat /proc/vmstat
>
> ...
> numa_pages_migrated 12790664
> pgmigrate_success 26835314
> nr_tlb_remote_flush 3031412
> nr_tlb_remote_flush_received 45234862
> nr_tlb_local_flush_all 216584
> nr_tlb_local_flush_one 740940
> ...
>
> AFTER - mainline v6.7 + migrc
> -----------------------------
> $ perf stat -a \
> -e itlb.itlb_flush \
> -e tlb_flush.dtlb_thread \
> -e tlb_flush.stlb_any \
> -e dTLB-load-misses \
> -e dTLB-store-misses \
> -e iTLB-load-misses \
> ./XSBench -p 100000000
>
> Performance counter stats for 'system wide':
>
> 5240261 itlb.itlb_flush
> 271581774 tlb_flush.dtlb_thread
> 243149389 tlb_flush.stlb_any
> 234502983364 dTLB-load-misses
> 395673680 dTLB-store-misses
> 1620215163 iTLB-load-misses
>
> 2172.283436287 seconds time elapsed
>
> $ cat /proc/vmstat
>
> ...
> numa_pages_migrated 14897064
> pgmigrate_success 30825530
> nr_tlb_remote_flush 198290
> nr_tlb_remote_flush_received 2820156
> nr_tlb_local_flush_all 92048
> nr_tlb_local_flush_one 741401
> ...
>
> ---
>
> Changes from v7:
> 1. Rewrite cover letter to explain what 'migrc' mechasism is.
> (feedbacked by Andrew Morton)
> 2. Supplement the commit message of a patch 'mm: Add APIs to
> free a folio directly to the buddy bypassing pcp'.
> (feedbacked by Andrew Morton)
>
> Changes from v6:
> 1. Fix build errors in case of
> CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH disabled by moving
> migrc_flush_{start,end}() calls from arch code to
> try_to_unmap_flush() in mm/rmap.c.
>
> Changes from v5:
> 1. Fix build errors in case of CONFIG_MIGRATION disabled or
> CONFIG_HWPOISON_INJECT moduled. (feedbacked by kernel test
> bot and Raymond Jay Golo)
> 2. Organize migrc code with two kconfigs, CONFIG_MIGRATION and
> CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
>
> Changes from v4:
>
> 1. Rebase on v6.7.
> 2. Fix build errors in arm64 that is doing nothing for TLB flush
> but has CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH. (reported
> by kernel test robot)
> 3. Don't use any page flag. So the system would give up migrc
> mechanism more often but it's okay. The final improvement is
> good enough.
> 4. Instead, optimize full TLB flush(arch_tlbbatch_flush()) by
> avoiding redundant CPUs from TLB flush.
>
> Changes from v3:
>
> 1. Don't use the kconfig, CONFIG_MIGRC, and remove sysctl knob,
> migrc_enable. (feedbacked by Nadav)
> 2. Remove the optimization skipping CPUs that have already
> performed TLB flushes needed by any reason when performing
> TLB flushes by migrc because I can't tell the performance
> difference between w/ the optimization and w/o that.
> (feedbacked by Nadav)
> 3. Minimize arch-specific code. While at it, move all the migrc
> declarations and inline functions from include/linux/mm.h to
> mm/internal.h (feedbacked by Dave Hansen, Nadav)
> 4. Separate a part making migrc paused when the system is in
> high memory pressure to another patch. (feedbacked by Nadav)
> 5. Rename:
> a. arch_tlbbatch_clean() to arch_tlbbatch_clear(),
> b. tlb_ubc_nowr to tlb_ubc_ro,
> c. migrc_try_flush_free_folios() to migrc_flush_free_folios(),
> d. migrc_stop to migrc_pause.
> (feedbacked by Nadav)
> 6. Use ->lru list_head instead of introducing a new llist_head.
> (feedbacked by Nadav)
> 7. Use non-atomic operations of page-flag when it's safe.
> (feedbacked by Nadav)
> 8. Use stack instead of keeping a pointer of 'struct migrc_req'
> in struct task, which is for manipulating it locally.
> (feedbacked by Nadav)
> 9. Replace a lot of simple functions to inline functions placed
> in a header, mm/internal.h. (feedbacked by Nadav)
> 10. Add additional sufficient comments. (feedbacked by Nadav)
> 11. Remove a lot of wrapper functions. (feedbacked by Nadav)
>
> Changes from RFC v2:
>
> 1. Remove additional occupation in struct page. To do that,
> unioned with lru field for migrc's list and added a page
> flag. I know page flag is a thing that we don't like to add
> but no choice because migrc should distinguish folios under
> migrc's control from others. Instead, I force migrc to be
> used only on 64 bit system to mitigate you guys from getting
> angry.
> 2. Remove meaningless internal object allocator that I
> introduced to minimize impact onto the system. However, a ton
> of tests showed there was no difference.
> 3. Stop migrc from working when the system is in high memory
> pressure like about to perform direct reclaim. At the
> condition where the swap mechanism is heavily used, I found
> the system suffered from regression without this control.
> 4. Exclude folios that pte_dirty() == true from migrc's interest
> so that migrc can work simpler.
> 5. Combine several patches that work tightly coupled to one.
> 6. Add sufficient comments for better review.
> 7. Manage migrc's request in per-node manner (from globally).
> 8. Add TLB miss improvement in commit message.
> 9. Test with more CPUs(4 -> 16) to see bigger improvement.
>
> Changes from RFC:
>
> 1. Fix a bug triggered when a destination folio at the previous
> migration becomes a source folio at the next migration,
> before the folio gets handled properly so that the folio can
> play with another migration. There was inconsistency in the
> folio's state. Fixed it.
> 2. Split the patch set into more pieces so that the folks can
> review better. (Feedbacked by Nadav Amit)
> 3. Fix a wrong usage of barrier e.g. smp_mb__after_atomic().
> (Feedbacked by Nadav Amit)
> 4. Tried to add sufficient comments to explain the patch set
> better. (Feedbacked by Nadav Amit)
>
> Byungchul Park (8):
> x86/tlb: Add APIs manipulating tlb batch's arch data
> arm64: tlbflush: Add APIs manipulating tlb batch's arch data
> mm/rmap: Recognize read-only TLB entries during batched TLB flush
> x86/tlb, mm/rmap: Separate arch_tlbbatch_clear() out of
> arch_tlbbatch_flush()
> mm: Separate move/undo doing on folio list from migrate_pages_batch()
> mm: Add APIs to free a folio directly to the buddy bypassing pcp
> mm: Defer TLB flush by keeping both src and dst folios at migration
> mm: Pause migrc mechanism at high memory pressure
>
> arch/arm64/include/asm/tlbflush.h | 19 ++
> arch/x86/include/asm/tlbflush.h | 18 ++
> arch/x86/mm/tlb.c | 2 -
> include/linux/mm.h | 23 ++
> include/linux/mmzone.h | 7 +
> include/linux/sched.h | 9 +
> mm/internal.h | 78 ++++++
> mm/memory.c | 8 +
> mm/migrate.c | 411 ++++++++++++++++++++++++++----
> mm/page_alloc.c | 34 ++-
> mm/rmap.c | 40 ++-
> mm/swap.c | 7 +
> 12 files changed, 597 insertions(+), 59 deletions(-)
>
>
> base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
> --
> 2.17.1
On 29.02.24 10:28, Byungchul Park wrote:
> On Mon, Feb 26, 2024 at 12:06:05PM +0900, Byungchul Park wrote:
>> Hi everyone,
>>
>> While I'm working with a tiered memory system e.g. CXL memory, I have
>> been facing migration overhead esp. TLB shootdown on promotion or
>> demotion between different tiers. Yeah.. most TLB shootdowns on
>> migration through hinting fault can be avoided thanks to Huang Ying's
>> work, commit 4d4b6d66db ("mm,unmap: avoid flushing TLB in batch if PTE
>> is inaccessible"). See the following link:
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> However, it's only for ones using hinting fault. I thought it'd be much
>> better if we have a general mechanism to reduce the number of TLB
>> flushes and TLB misses, that we can ultimately apply to any type of
>> migration, I tried it only for tiering for now tho.
>>
>> I'm suggesting a mechanism called MIGRC that stands for 'Migration Read
>> Copy', to reduce TLB flushes by keeping source and destination of folios
>> participated in the migrations until all TLB flushes required are done,
>> only if those folios are not mapped with write permission PTE entries.
>>
>> To achieve that:
>>
>> 1. For the folios that map only to non-writable TLB entries, prevent
>> TLB flush at migration by keeping both source and destination
>> folios, which will be handled later at a better time.
>>
>> 2. When any non-writable TLB entry changes to writable e.g. through
>> fault handler, give up migrc mechanism so as to perform TLB flush
>> required right away.
>>
>> I observed a big improvement of TLB flushes # and TLB misses # at the
>> following evaluation using XSBench like:
>>
>> 1. itlb flush was reduced by 93.9%.
>> 2. dtlb thread was reduced by 43.5%.
>> 3. stlb flush was reduced by 24.9%.
>
> Hi guys,
Hi,
>
> The TLB flush reduction is 25% ~ 94%, IMO, it's unbelievable.
Can't we find at least one benchmark that shows an actual improvement on
some system?
Staring at the number TLB flushes is nice, but if it does not affect
actual performance of at least one benchmark why do we even care?
"12 files changed, 597 insertions(+), 59 deletions(-)"
is not negligible and needs proper review.
That review needs motivation. The current numbers do not seem to be
motivating enough :)
--
Cheers,
David / dhildenb
David Hildenbrand <[email protected]> writes:
> On 29.02.24 10:28, Byungchul Park wrote:
>> On Mon, Feb 26, 2024 at 12:06:05PM +0900, Byungchul Park wrote:
>>> Hi everyone,
>>>
>>> While I'm working with a tiered memory system e.g. CXL memory, I have
>>> been facing migration overhead esp. TLB shootdown on promotion or
>>> demotion between different tiers. Yeah.. most TLB shootdowns on
>>> migration through hinting fault can be avoided thanks to Huang Ying's
>>> work, commit 4d4b6d66db ("mm,unmap: avoid flushing TLB in batch if PTE
>>> is inaccessible"). See the following link:
>>>
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> However, it's only for ones using hinting fault. I thought it'd be much
>>> better if we have a general mechanism to reduce the number of TLB
>>> flushes and TLB misses, that we can ultimately apply to any type of
>>> migration, I tried it only for tiering for now tho.
>>>
>>> I'm suggesting a mechanism called MIGRC that stands for 'Migration Read
>>> Copy', to reduce TLB flushes by keeping source and destination of folios
>>> participated in the migrations until all TLB flushes required are done,
>>> only if those folios are not mapped with write permission PTE entries.
>>>
>>> To achieve that:
>>>
>>> 1. For the folios that map only to non-writable TLB entries, prevent
>>> TLB flush at migration by keeping both source and destination
>>> folios, which will be handled later at a better time.
>>>
>>> 2. When any non-writable TLB entry changes to writable e.g. through
>>> fault handler, give up migrc mechanism so as to perform TLB flush
>>> required right away.
>>>
>>> I observed a big improvement of TLB flushes # and TLB misses # at the
>>> following evaluation using XSBench like:
>>>
>>> 1. itlb flush was reduced by 93.9%.
>>> 2. dtlb thread was reduced by 43.5%.
>>> 3. stlb flush was reduced by 24.9%.
>> Hi guys,
>
> Hi,
>
>> The TLB flush reduction is 25% ~ 94%, IMO, it's unbelievable.
>
> Can't we find at least one benchmark that shows an actual improvement
> on some system?
>
> Staring at the number TLB flushes is nice, but if it does not affect
> actual performance of at least one benchmark why do we even care?
>
> "12 files changed, 597 insertions(+), 59 deletions(-)"
>
> is not negligible and needs proper review.
And, the TLB flush is reduced at cost of memory wastage. The old pages
could have been freed. That may cause regression for some workloads.
> That review needs motivation. The current numbers do not seem to be
> motivating enough :)
--
Best Regards,
Huang, Ying
On Thu, Feb 29, 2024 at 10:33:44AM +0100, David Hildenbrand wrote:
> On 29.02.24 10:28, Byungchul Park wrote:
> > On Mon, Feb 26, 2024 at 12:06:05PM +0900, Byungchul Park wrote:
> > > Hi everyone,
> > >
> > > While I'm working with a tiered memory system e.g. CXL memory, I have
> > > been facing migration overhead esp. TLB shootdown on promotion or
> > > demotion between different tiers. Yeah.. most TLB shootdowns on
> > > migration through hinting fault can be avoided thanks to Huang Ying's
> > > work, commit 4d4b6d66db ("mm,unmap: avoid flushing TLB in batch if PTE
> > > is inaccessible"). See the following link:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > However, it's only for ones using hinting fault. I thought it'd be much
> > > better if we have a general mechanism to reduce the number of TLB
> > > flushes and TLB misses, that we can ultimately apply to any type of
> > > migration, I tried it only for tiering for now tho.
> > >
> > > I'm suggesting a mechanism called MIGRC that stands for 'Migration Read
> > > Copy', to reduce TLB flushes by keeping source and destination of folios
> > > participated in the migrations until all TLB flushes required are done,
> > > only if those folios are not mapped with write permission PTE entries.
> > >
> > > To achieve that:
> > >
> > > 1. For the folios that map only to non-writable TLB entries, prevent
> > > TLB flush at migration by keeping both source and destination
> > > folios, which will be handled later at a better time.
> > >
> > > 2. When any non-writable TLB entry changes to writable e.g. through
> > > fault handler, give up migrc mechanism so as to perform TLB flush
> > > required right away.
> > >
> > > I observed a big improvement of TLB flushes # and TLB misses # at the
> > > following evaluation using XSBench like:
> > >
> > > 1. itlb flush was reduced by 93.9%.
> > > 2. dtlb thread was reduced by 43.5%.
> > > 3. stlb flush was reduced by 24.9%.
> >
> > Hi guys,
>
> Hi,
>
> >
> > The TLB flush reduction is 25% ~ 94%, IMO, it's unbelievable.
>
> Can't we find at least one benchmark that shows an actual improvement on
> some system?
XSBench is more like a real workload that is used for performance
analysis on high performance computing architectrues, not micro
benchmark only for testing TLB things.
XSBench : https://github.com/ANL-CESAR/XSBench
Not to mention TLB numbers, the performance improvement is a little but
clearly positive as you can see the result I shared.
Byungchul
> Staring at the number TLB flushes is nice, but if it does not affect actual
> performance of at least one benchmark why do we even care?
>
> "12 files changed, 597 insertions(+), 59 deletions(-)"
>
> is not negligible and needs proper review.
>
> That review needs motivation. The current numbers do not seem to be
> motivating enough :)
>
> --
> Cheers,
>
> David / dhildenb
On Fri, Mar 01, 2024 at 08:33:11AM +0800, Huang, Ying wrote:
> David Hildenbrand <[email protected]> writes:
>
> > On 29.02.24 10:28, Byungchul Park wrote:
> >> On Mon, Feb 26, 2024 at 12:06:05PM +0900, Byungchul Park wrote:
> >>> Hi everyone,
> >>>
> >>> While I'm working with a tiered memory system e.g. CXL memory, I have
> >>> been facing migration overhead esp. TLB shootdown on promotion or
> >>> demotion between different tiers. Yeah.. most TLB shootdowns on
> >>> migration through hinting fault can be avoided thanks to Huang Ying's
> >>> work, commit 4d4b6d66db ("mm,unmap: avoid flushing TLB in batch if PTE
> >>> is inaccessible"). See the following link:
> >>>
> >>> https://lore.kernel.org/lkml/[email protected]/
> >>>
> >>> However, it's only for ones using hinting fault. I thought it'd be much
> >>> better if we have a general mechanism to reduce the number of TLB
> >>> flushes and TLB misses, that we can ultimately apply to any type of
> >>> migration, I tried it only for tiering for now tho.
> >>>
> >>> I'm suggesting a mechanism called MIGRC that stands for 'Migration Read
> >>> Copy', to reduce TLB flushes by keeping source and destination of folios
> >>> participated in the migrations until all TLB flushes required are done,
> >>> only if those folios are not mapped with write permission PTE entries.
> >>>
> >>> To achieve that:
> >>>
> >>> 1. For the folios that map only to non-writable TLB entries, prevent
> >>> TLB flush at migration by keeping both source and destination
> >>> folios, which will be handled later at a better time.
> >>>
> >>> 2. When any non-writable TLB entry changes to writable e.g. through
> >>> fault handler, give up migrc mechanism so as to perform TLB flush
> >>> required right away.
> >>>
> >>> I observed a big improvement of TLB flushes # and TLB misses # at the
> >>> following evaluation using XSBench like:
> >>>
> >>> 1. itlb flush was reduced by 93.9%.
> >>> 2. dtlb thread was reduced by 43.5%.
> >>> 3. stlb flush was reduced by 24.9%.
> >> Hi guys,
> >
> > Hi,
> >
> >> The TLB flush reduction is 25% ~ 94%, IMO, it's unbelievable.
> >
> > Can't we find at least one benchmark that shows an actual improvement
> > on some system?
> >
> > Staring at the number TLB flushes is nice, but if it does not affect
> > actual performance of at least one benchmark why do we even care?
> >
> > "12 files changed, 597 insertions(+), 59 deletions(-)"
> >
> > is not negligible and needs proper review.
>
> And, the TLB flush is reduced at cost of memory wastage. The old pages
> could have been freed. That may cause regression for some workloads.
You seem to understand the key of migrc(migation read copy) :) Yeah, the
most important thing to deal with is to remove the 'memory wastage'. The
pages deferred to free for the optimization can be freed anytime when
it's needed though TLB flush required that would've been already flushed
unless migrc mechanism.
So memory wastage can be totally removed if resolving some technical
issues that might need your help :)
Byungchul
> > That review needs motivation. The current numbers do not seem to be
> > motivating enough :)
>
> --
> Best Regards,
> Huang, Ying