2023-10-30 07:41:16

by Byungchul Park

[permalink] [raw]
Subject: [v3 0/3] Reduce TLB flushes under some specific conditions

To Huang Ying,

I tried to apply migrc to swap. Fortunately I couldn't see any
regression but no performance improvement either. I thought it's
meaningless to make code bigger without observing any benefit. So I
won't include the part. Thoughts?

To Nadav Amit,

I tried to split this patch set to as many as possible for better
review. However, it was very hard to make each patch meaningfully and
stably work because it works very tightly coupled to one another. So I
ended in combining those patches to one again. Instead, I tried my best
to add sufficient comments in code. Any opinion would be appreciated.

Hi everyone,

While I'm working with CXL memory, I have been facing migraion overhead
esp. TLB shootdown on promotion or demotion between different tiers.
Yeah.. most TLB shootdowns on migration through hinting fault can be
avoided thanks to Huang Ying's work, commit 4d4b6d66db ("mm,unmap: avoid
flushing TLB in batch if PTE is inaccessible").

However, it's only for ones using hinting fault. I thought it'd be much
better if we have a general mechanism to reduce # of TLB flushes and
TLB misses, that we can apply to any type of migration. I tried it only
for tiering migration for now tho.

I'm suggesting a mechanism to reduce TLB flushes by keeping source and
destination of folios participated in the migrations until all TLB
flushes required are done, only if those folios are not mapped with
write permission PTE entries at all. I worked Based on v6.6-rc5.

Can you believe it? I saw the number of TLB full flush reduced about
80%, iTLB miss reduced about 50% and the performance improved a little
bit with the workload I tested with, XSBench. However, I believe that it
would help more with other ones or any real ones. It'd be appreciated to
let me know if I'm missing something.

Byungchul

---

Changes from RFC v2:
1. Remove additional occupation in struct page. To do that,
unioned with lru field for migrc's list and added a page
flag. I know page flag is a thing that we don't like to add
but no choice because migrc should distinguish folios under
migrc's control from others. Instead, I force migrc to be
used only on 64 bit system to mitigate you guys from getting
angry.

2. Remove meaningless internal object allocator that I
introduced to minimize impact onto the system. However, a ton
of tests showed there was no difference.

3. Stop migrc from working when the system is in high memory
pressure like about to perform direct reclaim. At the
condition where the swap mechanism is heavily used, I found
the system suffered from regression without this control.

4. Exclude folios that pte_dirty() == true from migrc's interest
so that migrc can work simpler.

5. Combine several patches that work tightly coupled to one.

6. Add sufficient comments for better review.

7. Manage migrc's request in per-node manner (from globally).

8. Add TLB miss improvement in commit message.

9. Test with more CPUs(4 -> 16) to see bigger improvement.

Changes from RFC:

1. Fix a bug triggered when a destination folio at the previous
migration becomes a source folio at the next migration,
before the folio gets handled properly so that the folio can
play with another migration. There was inconsistency in the
folio's state. Fixed it.

2. Split the patch set into more pieces so that the folks can
review better. (Feedbacked by Nadav Amit)

3. Fix a wrong usage of barrier e.g. smp_mb__after_atomic().
(Feedbacked by Nadav Amit)

4. Tried to add sufficient comments to explain the patch set
better. (Feedbacked by Nadav Amit)

Byungchul Park (3):
mm/rmap: Recognize non-writable TLB entries during TLB batch flush
mm: Defer TLB flush by keeping both src and dst folios at migration
mm, migrc: Add a sysctl knob to enable/disable MIGRC mechanism

arch/x86/include/asm/tlbflush.h | 9 +
arch/x86/mm/tlb.c | 98 ++++++++++
include/linux/mm.h | 25 +++
include/linux/mm_types.h | 49 +++++
include/linux/mm_types_task.h | 4 +-
include/linux/mmzone.h | 7 +
include/linux/page-flags.h | 7 +
include/linux/sched.h | 9 +
include/trace/events/mmflags.h | 9 +-
init/Kconfig | 14 ++
mm/internal.h | 14 ++
mm/memory.c | 13 ++
mm/migrate.c | 310 ++++++++++++++++++++++++++++++++
mm/page_alloc.c | 29 ++-
mm/rmap.c | 115 +++++++++++-
15 files changed, 703 insertions(+), 9 deletions(-)

--
2.17.1


2023-10-30 07:41:44

by Byungchul Park

[permalink] [raw]
Subject: [v3 1/3] mm/rmap: Recognize non-writable TLB entries during TLB batch flush

Functionally, no change. This is a preparation for CONFIG_MIGRC that
requires to recognize non-writable TLB entries and makes use of them to
batch more aggressively or even skip TLB flushes.

While at it, changed struct tlbflush_unmap's ->flush_required(boolean)
to ->nr_flush_required(int) in order to take into account not only
whether it has been requested or not, but also the exact number of the
requests. That will be used in CONFIG_MIGRC implementation.

Signed-off-by: Byungchul Park <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 2 ++
arch/x86/mm/tlb.c | 7 ++++++
include/linux/mm_types_task.h | 4 ++--
include/linux/sched.h | 1 +
mm/internal.h | 14 ++++++++++++
mm/rmap.c | 39 ++++++++++++++++++++++++++++-----
6 files changed, 60 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 25726893c6f4..9d1361d2119c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -292,6 +292,8 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
}

extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+ struct arch_tlbflush_unmap_batch *bsrc);

static inline bool pte_flags_need_flush(unsigned long oldflags,
unsigned long newflags,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 453ea95b667d..314cd9912a88 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1274,6 +1274,13 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
put_cpu();
}

+void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+ struct arch_tlbflush_unmap_batch *bsrc)
+{
+ cpumask_or(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+ cpumask_clear(&bsrc->cpumask);
+}
+
/*
* Blindly accessing user memory from NMI context can be dangerous
* if we're in the middle of switching the current user task or
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index aa44fff8bb9d..35ba9425d48d 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -59,8 +59,8 @@ struct tlbflush_unmap_batch {
*/
struct arch_tlbflush_unmap_batch arch;

- /* True if a flush is needed. */
- bool flush_required;
+ /* The number of flush requested. */
+ int nr_flush_required;

/*
* If true then the PTE was dirty when unmapped. The entry must be
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77f01ac385f7..63189c023357 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1324,6 +1324,7 @@ struct task_struct {
#endif

struct tlbflush_unmap_batch tlb_ubc;
+ struct tlbflush_unmap_batch tlb_ubc_nowr;

/* Cache last used pipe for splice(): */
struct pipe_inode_info *splice_pipe;
diff --git a/mm/internal.h b/mm/internal.h
index 30cf724ddbce..e20d5b223901 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -861,6 +861,9 @@ extern struct workqueue_struct *mm_percpu_wq;
void try_to_unmap_flush(void);
void try_to_unmap_flush_dirty(void);
void flush_tlb_batched_pending(struct mm_struct *mm);
+void fold_ubc_nowr(void);
+int nr_flush_required(void);
+int nr_flush_required_nowr(void);
#else
static inline void try_to_unmap_flush(void)
{
@@ -871,6 +874,17 @@ static inline void try_to_unmap_flush_dirty(void)
static inline void flush_tlb_batched_pending(struct mm_struct *mm)
{
}
+static inline void fold_ubc_nowr(void)
+{
+}
+static inline int nr_flush_required(void)
+{
+ return 0;
+}
+static inline int nr_flush_required_nowr(void)
+{
+ return 0;
+}
#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */

extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/rmap.c b/mm/rmap.c
index 9f795b93cf40..a045f3b57c60 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -605,6 +605,32 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
}

#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+void fold_ubc_nowr(void)
+{
+ struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+ struct tlbflush_unmap_batch *tlb_ubc_nowr = &current->tlb_ubc_nowr;
+
+ if (!tlb_ubc_nowr->nr_flush_required)
+ return;
+
+ arch_tlbbatch_fold(&tlb_ubc->arch, &tlb_ubc_nowr->arch);
+ tlb_ubc->writable = tlb_ubc->writable || tlb_ubc_nowr->writable;
+ tlb_ubc->nr_flush_required += tlb_ubc_nowr->nr_flush_required;
+ tlb_ubc_nowr->nr_flush_required = 0;
+ tlb_ubc_nowr->writable = false;
+}
+
+int nr_flush_required(void)
+{
+ return current->tlb_ubc.nr_flush_required;
+}
+
+int nr_flush_required_nowr(void)
+{
+ return current->tlb_ubc_nowr.nr_flush_required;
+}
+
/*
* Flush TLB entries for recently unmapped pages from remote CPUs. It is
* important if a PTE was dirty when it was unmapped that it's flushed
@@ -615,11 +641,12 @@ void try_to_unmap_flush(void)
{
struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;

- if (!tlb_ubc->flush_required)
+ fold_ubc_nowr();
+ if (!tlb_ubc->nr_flush_required)
return;

arch_tlbbatch_flush(&tlb_ubc->arch);
- tlb_ubc->flush_required = false;
+ tlb_ubc->nr_flush_required = 0;
tlb_ubc->writable = false;
}

@@ -627,8 +654,9 @@ void try_to_unmap_flush(void)
void try_to_unmap_flush_dirty(void)
{
struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+ struct tlbflush_unmap_batch *tlb_ubc_nowr = &current->tlb_ubc_nowr;

- if (tlb_ubc->writable)
+ if (tlb_ubc->writable || tlb_ubc_nowr->writable)
try_to_unmap_flush();
}

@@ -645,15 +673,16 @@ void try_to_unmap_flush_dirty(void)
static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
unsigned long uaddr)
{
- struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+ struct tlbflush_unmap_batch *tlb_ubc;
int batch;
bool writable = pte_dirty(pteval);

if (!pte_accessible(mm, pteval))
return;

+ tlb_ubc = pte_write(pteval) || writable ? &current->tlb_ubc : &current->tlb_ubc_nowr;
arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
- tlb_ubc->flush_required = true;
+ tlb_ubc->nr_flush_required += 1;

/*
* Ensure compiler does not re-order the setting of tlb_flush_batched
--
2.17.1

2023-10-30 07:41:52

by Byungchul Park

[permalink] [raw]
Subject: [v3 3/3] mm, migrc: Add a sysctl knob to enable/disable MIGRC mechanism

Add a sysctl knob, '/proc/sys/vm/migrc_enable' to switch on/off migrc.

Signed-off-by: Byungchul Park <[email protected]>
---
mm/migrate.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 76278eca8417..f6bbdc2d4fb1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -58,6 +58,48 @@
#include "internal.h"

#ifdef CONFIG_MIGRC
+static int sysctl_migrc_enable = 1;
+#ifdef CONFIG_SYSCTL
+static int sysctl_migrc_enable_handler(struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table t;
+ int err;
+ int enabled = sysctl_migrc_enable;
+
+ if (write && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ t = *table;
+ t.data = &enabled;
+ err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+ if (err < 0)
+ return err;
+ if (write)
+ sysctl_migrc_enable = enabled;
+ return err;
+}
+
+static struct ctl_table migrc_sysctls[] = {
+ {
+ .procname = "migrc_enable",
+ .data = NULL, /* filled in by handler */
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = sysctl_migrc_enable_handler,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
+ {}
+};
+
+static int __init migrc_sysctl_init(void)
+{
+ register_sysctl_init("vm", migrc_sysctls);
+ return 0;
+}
+late_initcall(migrc_sysctl_init);
+#endif

atomic_t migrc_gen;

@@ -242,7 +284,7 @@ static inline bool migrc_stopped_pending(void)

static bool can_migrc_system(void)
{
- return !migrc_stopped_pending();
+ return sysctl_migrc_enable && !migrc_stopped_pending();
}
#else
static inline void migrc_mark_folios(struct folio *fsrc, struct folio *fdst) {}
--
2.17.1

2023-10-30 07:42:09

by Byungchul Park

[permalink] [raw]
Subject: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
We always face the migration overhead at either promotion or demotion,
while working with tiered memory e.g. CXL memory and found out TLB
shootdown is a quite big one that is needed to get rid of if possible.

Fortunately, TLB flush can be defered or even skipped if both source and
destination of folios during migration are kept until all TLB flushes
required will have been done, of course, only if the target PTE entries
have read only permission, more precisely speaking, don't have write
permission. Otherwise, no doubt the folio might get messed up.

To achieve that:

1. For the folios that map only to non-writable TLB entries, prevent
TLB flush at migration by keeping both source and destination
folios, which will be handled later at a better time.

2. When any non-writable TLB entry changes to writable e.g. through
fault handler, give up CONFIG_MIGRC mechanism so as to perform
TLB flush required right away.

3. Temporarily stop migrc from working when the system is in very
high memory pressure e.g. direct reclaim needed.

The measurement result:

Architecture - x86_64
QEMU - kvm enabled, host cpu
Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled
Benchmark - XSBench -p 50000000 (-p option makes the runtime longer)

run 'perf stat' using events:
1) itlb.itlb_flush
2) tlb_flush.dtlb_thread
3) tlb_flush.stlb_any
4) dTLB-load-misses
5) dTLB-store-misses
6) iTLB-load-misses

run 'cat /proc/vmstat' and pick:
1) numa_pages_migrated
2) pgmigrate_success
3) nr_tlb_remote_flush
4) nr_tlb_remote_flush_received
5) nr_tlb_local_flush_all
6) nr_tlb_local_flush_one

BEFORE - mainline v6.6-rc5
------------------------------------------
$ perf stat -a \
-e itlb.itlb_flush \
-e tlb_flush.dtlb_thread \
-e tlb_flush.stlb_any \
-e dTLB-load-misses \
-e dTLB-store-misses \
-e iTLB-load-misses \
./XSBench -p 50000000

Performance counter stats for 'system wide':

20953405 itlb.itlb_flush
114886593 tlb_flush.dtlb_thread
88267015 tlb_flush.stlb_any
115304095543 dTLB-load-misses
163904743 dTLB-store-misses
608486259 iTLB-load-misses

556.787113849 seconds time elapsed

$ cat /proc/vmstat

...
numa_pages_migrated 3378748
pgmigrate_success 7720310
nr_tlb_remote_flush 751464
nr_tlb_remote_flush_received 10742115
nr_tlb_local_flush_all 21899
nr_tlb_local_flush_one 740157
...

AFTER - mainline v6.6-rc5 + CONFIG_MIGRC
------------------------------------------
$ perf stat -a \
-e itlb.itlb_flush \
-e tlb_flush.dtlb_thread \
-e tlb_flush.stlb_any \
-e dTLB-load-misses \
-e dTLB-store-misses \
-e iTLB-load-misses \
./XSBench -p 50000000

Performance counter stats for 'system wide':

4353555 itlb.itlb_flush
72482780 tlb_flush.dtlb_thread
68226458 tlb_flush.stlb_any
114331610808 dTLB-load-misses
116084771 dTLB-store-misses
377180518 iTLB-load-misses

552.667718220 seconds time elapsed

$ cat /proc/vmstat

...
numa_pages_migrated 3339325
pgmigrate_success 7642363
nr_tlb_remote_flush 192913
nr_tlb_remote_flush_received 2327426
nr_tlb_local_flush_all 25759
nr_tlb_local_flush_one 740454
...

Signed-off-by: Byungchul Park <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 7 +
arch/x86/mm/tlb.c | 91 +++++++++++
include/linux/mm.h | 25 +++
include/linux/mm_types.h | 49 ++++++
include/linux/mmzone.h | 7 +
include/linux/page-flags.h | 7 +
include/linux/sched.h | 8 +
include/trace/events/mmflags.h | 9 +-
init/Kconfig | 14 ++
mm/memory.c | 13 ++
mm/migrate.c | 268 ++++++++++++++++++++++++++++++++
mm/page_alloc.c | 29 +++-
mm/rmap.c | 76 +++++++++
13 files changed, 601 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 9d1361d2119c..cc52d83d17ee 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -292,9 +292,16 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
}

extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_clean(struct arch_tlbflush_unmap_batch *batch);
extern void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
struct arch_tlbflush_unmap_batch *bsrc);

+#ifdef CONFIG_MIGRC
+extern void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen);
+#else
+static inline void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen) {}
+#endif
+
static inline bool pte_flags_need_flush(unsigned long oldflags,
unsigned long newflags,
bool ignore_access)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 314cd9912a88..0bdb617ffe94 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1219,9 +1219,80 @@ STATIC_NOPV void native_flush_tlb_local(void)
native_write_cr3(__native_read_cr3());
}

+#ifdef CONFIG_MIGRC
+DEFINE_PER_CPU(int, migrc_done);
+
+static inline bool before(int a, int b)
+{
+ return a - b < 0;
+}
+
+static inline int migrc_tlb_local_begin(void)
+{
+ int ret = atomic_read(&migrc_gen);
+
+ /*
+ * The following order should be guaranteed:
+ *
+ * 1. A migrc's request is queued at migration.
+ * 2. Each CPU performs __flush_tlb_local().
+ * 3. Marks this CPU has performed the TLB flush after 1.
+ *
+ * To achieve that, a kind of timestamp, migrc_gen was
+ * introduced to track the the order but conservatively, that
+ * is, it doesn't care whether TLB flush is performed more than
+ * it's needed but should guarantee TLB flush has been performed
+ * after the timestamp in migrc_done.
+ *
+ * 1 is done by migrc_unmap_done() and migrc_req_end().
+ * 2 is done by __flush_tlb_local().
+ * 3 is done by migrc_tlb_local_begin() and migrc_tlb_local_end().
+ *
+ * FYI, an example that should avoid is:
+ *
+ * 1. Performs __flush_tlb_local().
+ * 2. A migrc's request is queued at migration.
+ * 3. Marks this CPU has performed the TLB flush after 2(?).
+ *
+ * XXX: barrier() would be sufficient if the architecture
+ * already quarantees the order between memory access to
+ * migrc_gen and TLB flush.
+ */
+ smp_mb();
+
+ return ret;
+}
+
+static inline void migrc_tlb_local_end(int gen)
+{
+ /*
+ * XXX: barrier() would be sufficient if the architecture
+ * quarantees the order between TLB flush and memory access to
+ * migrc_done.
+ */
+ smp_mb();
+
+ if (before(READ_ONCE(*this_cpu_ptr(&migrc_done)), gen))
+ WRITE_ONCE(*this_cpu_ptr(&migrc_done), gen);
+}
+#else
+static inline int migrc_tlb_local_begin(void)
+{
+ return 0;
+}
+
+static inline void migrc_tlb_local_end(int gen)
+{
+}
+#endif
+
void flush_tlb_local(void)
{
+ unsigned int gen;
+
+ gen = migrc_tlb_local_begin();
__flush_tlb_local();
+ migrc_tlb_local_end(gen);
}

/*
@@ -1246,6 +1317,21 @@ void __flush_tlb_all(void)
}
EXPORT_SYMBOL_GPL(__flush_tlb_all);

+#ifdef CONFIG_MIGRC
+/*
+ * Skip CPUs that have already performed TLB flushes required, which is
+ * tracked by migrc_gen and migrc_done.
+ */
+void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen)
+{
+ int cpu;
+
+ for_each_cpu(cpu, &batch->cpumask)
+ if (!before(READ_ONCE(*per_cpu_ptr(&migrc_done, cpu)), gen))
+ cpumask_clear_cpu(cpu, &batch->cpumask);
+}
+#endif
+
void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
struct flush_tlb_info *info;
@@ -1274,6 +1360,11 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
put_cpu();
}

+void arch_tlbbatch_clean(struct arch_tlbflush_unmap_batch *batch)
+{
+ cpumask_clear(&batch->cpumask);
+}
+
void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
struct arch_tlbflush_unmap_batch *bsrc)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..ddefe6033b66 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4062,4 +4062,29 @@ static inline void accept_memory(phys_addr_t start, phys_addr_t end)

#endif

+#ifdef CONFIG_MIGRC
+void migrc_init_page(struct page *p);
+void migrc_unpend_folio(struct folio *f);
+bool migrc_pending_folio(struct folio *f);
+void migrc_shrink(struct llist_head *h);
+bool migrc_try_flush_free_folios(nodemask_t *nodes);
+void fold_ubc_nowr_to_migrc(void);
+void free_migrc_req(struct migrc_req *req);
+int migrc_pending_nr_in_zone(struct zone *z);
+void migrc_stop_pending(void);
+void migrc_resume_pending(void);
+
+extern atomic_t migrc_gen;
+#else
+static inline void migrc_init_page(struct page *p) {}
+static inline void migrc_unpend_folio(struct folio *f) {}
+static inline bool migrc_pending_folio(struct folio *f) { return false; }
+static inline void migrc_shrink(struct llist_head *h) {}
+static inline bool migrc_try_flush_free_folios(nodemask_t *nodes) { return false; }
+static inline void fold_ubc_nowr_to_migrc(void) {}
+static inline void free_migrc_req(struct migrc_req *req) {}
+static inline int migrc_pending_nr_in_zone(struct zone *z) { return 0; }
+static inline void migrc_stop_pending(void) {}
+static inline void migrc_resume_pending(void) {}
+#endif
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 36c5b43999e6..55350f1d0db2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -101,6 +101,13 @@ struct page {
/* Or, free page */
struct list_head buddy_list;
struct list_head pcp_list;
+#ifdef CONFIG_MIGRC
+ /*
+ * Hang onto migrc's request queues,
+ * waiting for TLB flushes.
+ */
+ struct llist_node migrc_node;
+#endif
};
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
@@ -306,6 +313,13 @@ struct folio {
/* private: */
};
/* public: */
+#ifdef CONFIG_MIGRC
+ /*
+ * Hang onto migrc's request queues,
+ * waiting for TLB flushes.
+ */
+ struct llist_node migrc_node;
+#endif
};
struct address_space *mapping;
pgoff_t index;
@@ -1372,4 +1386,39 @@ enum {
/* See also internal only FOLL flags in mm/internal.h */
};

+#ifdef CONFIG_MIGRC
+struct migrc_req {
+ /*
+ * folios pending for TLB flush
+ */
+ struct llist_head folios;
+
+ /*
+ * the last folio in 'struct llist_head folios'
+ */
+ struct llist_node *last;
+
+ /*
+ * for hanging to migrc's request queue, migrc_reqs
+ */
+ struct llist_node llnode;
+
+ /*
+ * architecture specific batch data
+ */
+ struct arch_tlbflush_unmap_batch arch;
+
+ /*
+ * timestamp when hanging to migrc's request queue, migrc_reqs
+ */
+ int gen;
+
+ /*
+ * associated numa node
+ */
+ int nid;
+};
+#else
+struct migrc_req {};
+#endif
#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc5b4b3..4e78d1c9ceb2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -980,6 +980,9 @@ struct zone {
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
+#ifdef CONFIG_MIGRC
+ atomic_t migrc_pending_nr;
+#endif
} ____cacheline_internodealigned_in_smp;

enum pgdat_flags {
@@ -1398,6 +1401,10 @@ typedef struct pglist_data {
#ifdef CONFIG_MEMORY_FAILURE
struct memory_failure_stats mf_stats;
#endif
+#ifdef CONFIG_MIGRC
+ atomic_t migrc_pending_nr;
+ struct llist_head migrc_reqs;
+#endif
} pg_data_t;

#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5c02720c53a5..1ca2ac91aa14 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -135,6 +135,9 @@ enum pageflags {
#ifdef CONFIG_ARCH_USES_PG_ARCH_X
PG_arch_2,
PG_arch_3,
+#endif
+#ifdef CONFIG_MIGRC
+ PG_migrc, /* Page has its copy under migrc's control */
#endif
__NR_PAGEFLAGS,

@@ -589,6 +592,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
PAGEFLAG(Idle, idle, PF_ANY)
#endif

+#ifdef CONFIG_MIGRC
+PAGEFLAG(Migrc, migrc, PF_ANY)
+#endif
+
/*
* PageReported() is used to track reported free pages within the Buddy
* allocator. We can use the non-atomic version of the test and set
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63189c023357..b7c402850b2f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1325,6 +1325,14 @@ struct task_struct {

struct tlbflush_unmap_batch tlb_ubc;
struct tlbflush_unmap_batch tlb_ubc_nowr;
+#ifdef CONFIG_MIGRC
+ /*
+ * Temporarily used while a migration is on processing before
+ * hanging to the associated numa node at the end of the
+ * migration.
+ */
+ struct migrc_req *mreq;
+#endif

/* Cache last used pipe for splice(): */
struct pipe_inode_info *splice_pipe;
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 1478b9dd05fa..85b3b8859bc8 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -95,6 +95,12 @@
#define IF_HAVE_PG_ARCH_X(_name)
#endif

+#ifdef CONFIG_MIGRC
+#define IF_HAVE_PG_MIGRC(_name) ,{1UL << PG_##_name, __stringify(_name)}
+#else
+#define IF_HAVE_PG_MIGRC(_name)
+#endif
+
#define DEF_PAGEFLAG_NAME(_name) { 1UL << PG_##_name, __stringify(_name) }

#define __def_pageflag_names \
@@ -125,7 +131,8 @@ IF_HAVE_PG_HWPOISON(hwpoison) \
IF_HAVE_PG_IDLE(idle) \
IF_HAVE_PG_IDLE(young) \
IF_HAVE_PG_ARCH_X(arch_2) \
-IF_HAVE_PG_ARCH_X(arch_3)
+IF_HAVE_PG_ARCH_X(arch_3) \
+IF_HAVE_PG_MIGRC(migrc)

#define show_page_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/init/Kconfig b/init/Kconfig
index 6d35728b94b2..4d88dab52059 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -908,6 +908,20 @@ config NUMA_BALANCING_DEFAULT_ENABLED
If set, automatic NUMA balancing will be enabled if running on a NUMA
machine.

+config MIGRC
+ bool "Deferring TLB flush by keeping read copies on migration"
+ depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+ depends on NUMA_BALANCING
+ depends on 64BIT
+ default n
+ help
+ TLB flush is necessary when PTE changes by migration. However,
+ TLB flush can be deferred if both copies of the src page and
+ the dst page are kept until TLB flush if they are non-writable.
+ System performance will be improved especially in case that
+ promotion and demotion types of migrations are heavily
+ happening.
+
menuconfig CGROUPS
bool "Control Group support"
select KERNFS
diff --git a/mm/memory.c b/mm/memory.c
index 6c264d2f969c..75dc48b6e15f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3359,6 +3359,19 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
if (vmf->page)
folio = page_folio(vmf->page);

+ /*
+ * This folio has its read copy to prevent inconsistency while
+ * deferring TLB flushes. However, the problem might arise if
+ * it's going to become writable.
+ *
+ * To prevent it, give up the deferring TLB flushes and perform
+ * TLB flush right away.
+ */
+ if (folio && migrc_pending_folio(folio)) {
+ migrc_unpend_folio(folio);
+ migrc_try_flush_free_folios(NULL);
+ }
+
/*
* Shared mapping: we are guaranteed to have VM_WRITE and
* FAULT_FLAG_WRITE set at this point.
diff --git a/mm/migrate.c b/mm/migrate.c
index 2053b54556ca..76278eca8417 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -57,6 +57,204 @@

#include "internal.h"

+#ifdef CONFIG_MIGRC
+
+atomic_t migrc_gen;
+
+static struct migrc_req *alloc_migrc_req(void)
+{
+ return kmalloc(sizeof(struct migrc_req), GFP_KERNEL);
+}
+
+void free_migrc_req(struct migrc_req *req)
+{
+ kfree(req);
+}
+
+void migrc_init_page(struct page *p)
+{
+ ClearPageMigrc(p);
+}
+
+/*
+ * One of migrc's core functions that frees up all the folios in a
+ * migrc's pending queue.
+ *
+ * The list should be isolated before.
+ */
+void migrc_shrink(struct llist_head *h)
+{
+ struct folio *f, *f2;
+ struct llist_node *n;
+
+ n = llist_del_all(h);
+ llist_for_each_entry_safe(f, f2, n, migrc_node) {
+ struct pglist_data *node;
+ struct zone *zone;
+
+ node = NODE_DATA(folio_nid(f));
+ zone = page_zone(&f->page);
+ atomic_dec(&node->migrc_pending_nr);
+ atomic_dec(&zone->migrc_pending_nr);
+ folio_put(f);
+ }
+}
+
+void migrc_unpend_folio(struct folio *f)
+{
+ folio_clear_migrc(f);
+}
+
+bool migrc_pending_folio(struct folio *f)
+{
+ return folio_test_migrc(f);
+}
+
+/*
+ * Marks the folios as being under migrc's control.
+ */
+static void migrc_mark_folios(struct folio *fsrc, struct folio *fdst)
+{
+ if (!current->mreq)
+ return;
+
+ folio_get(fsrc);
+ folio_set_migrc(fsrc);
+ folio_set_migrc(fdst);
+ fold_ubc_nowr_to_migrc();
+}
+
+static bool migrc_marked_folios(struct folio *fsrc, struct folio *fdst)
+{
+ /*
+ * XXX: The condition below is so tricky. Please suggest a
+ * better idea to determine if it's under migrc's control.
+ */
+ return migrc_pending_folio(fsrc) && migrc_pending_folio(fdst);
+}
+
+static void migrc_undo_folios(struct folio *fsrc, struct folio *fdst)
+{
+ /*
+ * TLB flushes needed are already done at this moment so the
+ * flag doesn't have to be kept.
+ */
+ folio_clear_migrc(fsrc);
+ folio_clear_migrc(fdst);
+
+ folio_put(fsrc);
+}
+
+static void migrc_expand_req(struct folio *f)
+{
+ struct migrc_req *req = current->mreq;
+ struct pglist_data *node;
+ struct zone *zone;
+
+ if (!req)
+ return;
+
+ /*
+ * Focusing on numa migration, all the nids in a req are same.
+ */
+ if (req->nid == -1)
+ req->nid = folio_nid(f);
+ else if (WARN_ON(req->nid != folio_nid(f)))
+ return;
+
+ if (llist_add(&f->migrc_node, &req->folios))
+ req->last = &f->migrc_node;
+
+ node = NODE_DATA(folio_nid(f));
+ zone = page_zone(&f->page);
+ atomic_inc(&node->migrc_pending_nr);
+ atomic_inc(&zone->migrc_pending_nr);
+}
+
+static void migrc_unmap_done(void)
+{
+ if (current->mreq)
+ current->mreq->gen = atomic_inc_return(&migrc_gen);
+}
+
+/*
+ * Prepares for gathering folios pending for TLB flushes, try to
+ * allocate objects needed, initialize them and make them ready.
+ */
+static void migrc_req_start(void)
+{
+ struct migrc_req *req;
+
+ if (WARN_ON(current->mreq))
+ return;
+
+ req = alloc_migrc_req();
+ if (!req)
+ return;
+
+ arch_tlbbatch_clean(&req->arch);
+ init_llist_head(&req->folios);
+ req->last = NULL;
+ req->nid = -1;
+ current->mreq = req;
+}
+
+/*
+ * Hang the request with the collected folios to 'migrc_reqs' llist of
+ * the corresponding node.
+ */
+static void migrc_req_end_if_started(void)
+{
+ struct migrc_req *req = current->mreq;
+
+ if (!req)
+ return;
+
+ if (llist_empty(&req->folios))
+ free_migrc_req(req);
+ else
+ llist_add(&req->llnode, &NODE_DATA(req->nid)->migrc_reqs);
+
+ current->mreq = NULL;
+}
+
+int migrc_pending_nr_in_zone(struct zone *z)
+{
+ return atomic_read(&z->migrc_pending_nr);
+}
+
+static atomic_t migrc_stop = ATOMIC_INIT(0);
+
+void migrc_stop_pending(void)
+{
+ atomic_inc(&migrc_stop);
+}
+
+void migrc_resume_pending(void)
+{
+ atomic_dec(&migrc_stop);
+}
+
+static inline bool migrc_stopped_pending(void)
+{
+ return !!atomic_read(&migrc_stop);
+}
+
+static bool can_migrc_system(void)
+{
+ return !migrc_stopped_pending();
+}
+#else
+static inline void migrc_mark_folios(struct folio *fsrc, struct folio *fdst) {}
+static inline bool migrc_marked_folios(struct folio *fsrc, struct folio *fdst) { return false; }
+static inline void migrc_undo_folios(struct folio *fsrc, struct folio *fdst) {}
+static inline void migrc_expand_req(struct folio *f) {}
+static inline void migrc_unmap_done(void) {}
+static inline void migrc_req_start(void) {}
+static inline void migrc_req_end_if_started(void) {}
+static inline bool can_migrc_system(void) { return false; }
+#endif
+
bool isolate_movable_page(struct page *page, isolate_mode_t mode)
{
struct folio *folio = folio_get_nontail_page(page);
@@ -379,6 +577,7 @@ static int folio_expected_refs(struct address_space *mapping,
struct folio *folio)
{
int refs = 1;
+
if (!mapping)
return refs;

@@ -406,6 +605,13 @@ int folio_migrate_mapping(struct address_space *mapping,
int expected_count = folio_expected_refs(mapping, folio) + extra_count;
long nr = folio_nr_pages(folio);

+ /*
+ * Migrc mechanism increases the reference count to defer
+ * freeing up folios until a better time.
+ */
+ if (migrc_marked_folios(folio, newfolio))
+ expected_count++;
+
if (!mapping) {
/* Anonymous page without mapping */
if (folio_ref_count(folio) != expected_count)
@@ -1620,16 +1826,30 @@ static int migrate_pages_batch(struct list_head *from,
LIST_HEAD(unmap_folios);
LIST_HEAD(dst_folios);
bool nosplit = (reason == MR_NUMA_MISPLACED);
+ bool migrc_cond1;

VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
!list_empty(from) && !list_is_singular(from));

+ /*
+ * For now, apply migrc only to numa migration.
+ */
+ migrc_cond1 = can_migrc_system() &&
+ ((reason == MR_DEMOTION && current_is_kswapd()) ||
+ (reason == MR_NUMA_MISPLACED));
+
+ if (migrc_cond1)
+ migrc_req_start();
for (pass = 0; pass < nr_pass && retry; pass++) {
retry = 0;
thp_retry = 0;
nr_retry_pages = 0;

list_for_each_entry_safe(folio, folio2, from, lru) {
+ int nr_required;
+ bool migrc_cond2;
+ bool migrc;
+
is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
nr_pages = folio_nr_pages(folio);

@@ -1657,9 +1877,34 @@ static int migrate_pages_batch(struct list_head *from,
continue;
}

+ /*
+ * In case that the system is in high memory
+ * pressure, let's stop migrc from working and
+ * free up all the folios in the pending queues
+ * not to impact the system.
+ */
+ if (unlikely(migrc_cond1 && !can_migrc_system())) {
+ migrc_cond1 = false;
+ migrc_req_end_if_started();
+ migrc_try_flush_free_folios(NULL);
+ }
+
+ nr_required = nr_flush_required();
rc = migrate_folio_unmap(get_new_folio, put_new_folio,
private, folio, &dst, mode, reason,
ret_folios);
+ /*
+ * migrc_cond2 will be true only if:
+ *
+ * 1. There's no writable mapping at all
+ * 2. There's read only mapping found
+ * 3. Not already under migrc's control
+ */
+ migrc_cond2 = nr_required == nr_flush_required() &&
+ nr_flush_required_nowr() &&
+ !migrc_pending_folio(folio);
+ migrc = migrc_cond1 && migrc_cond2;
+
/*
* The rules are:
* Success: folio will be freed
@@ -1720,6 +1965,9 @@ static int migrate_pages_batch(struct list_head *from,
case MIGRATEPAGE_UNMAP:
list_move_tail(&folio->lru, &unmap_folios);
list_add_tail(&dst->lru, &dst_folios);
+
+ if (migrc)
+ migrc_mark_folios(folio, dst);
break;
default:
/*
@@ -1733,12 +1981,24 @@ static int migrate_pages_batch(struct list_head *from,
stats->nr_failed_pages += nr_pages;
break;
}
+
+ /*
+ * Done with the current folio. Fold the ro
+ * batch data gathered, to the normal batch.
+ */
+ fold_ubc_nowr();
}
}
nr_failed += retry;
stats->nr_thp_failed += thp_retry;
stats->nr_failed_pages += nr_retry_pages;
move:
+ /*
+ * Generate a new timestamp that will be used to filter out
+ * CPUs that have already performed TLB flushes needed.
+ */
+ migrc_unmap_done();
+
/* Flush TLBs for all unmapped folios */
try_to_unmap_flush();

@@ -1774,6 +2034,9 @@ static int migrate_pages_batch(struct list_head *from,
case MIGRATEPAGE_SUCCESS:
stats->nr_succeeded += nr_pages;
stats->nr_thp_succeeded += is_thp;
+
+ if (migrc_marked_folios(folio, dst))
+ migrc_expand_req(folio);
break;
default:
nr_failed++;
@@ -1791,6 +2054,8 @@ static int migrate_pages_batch(struct list_head *from,

rc = rc_saved ? : nr_failed;
out:
+ migrc_req_end_if_started();
+
/* Cleanup remaining folios */
dst = list_first_entry(&dst_folios, struct folio, lru);
dst2 = list_next_entry(dst, lru);
@@ -1798,6 +2063,9 @@ static int migrate_pages_batch(struct list_head *from,
int page_was_mapped = 0;
struct anon_vma *anon_vma = NULL;

+ if (migrc_marked_folios(folio, dst))
+ migrc_undo_folios(folio, dst);
+
__migrate_folio_extract(dst, &page_was_mapped, &anon_vma);
migrate_folio_undo_src(folio, page_was_mapped, anon_vma,
true, ret_folios);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 95546f376302..2fa84c3a9681 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1535,6 +1535,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,

set_page_owner(page, order, gfp_flags);
page_table_check_alloc(page, order);
+
+ for (i = 0; i != 1 << order; ++i)
+ migrc_init_page(page + i);
}

static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
@@ -2839,6 +2842,8 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
long min = mark;
int o;

+ free_pages += migrc_pending_nr_in_zone(z);
+
/* free_pages may go negative - that's OK */
free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);

@@ -2933,7 +2938,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
long usable_free;
long reserved;

- usable_free = free_pages;
+ usable_free = free_pages + migrc_pending_nr_in_zone(z);
reserved = __zone_watermark_unusable_free(z, 0, alloc_flags);

/* reserved may over estimate high-atomic reserves. */
@@ -3121,6 +3126,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
gfp_mask)) {
int ret;

+ /*
+ * Free the pending folios so that the remaining
+ * code can use them. Check zone_watermark_fast()
+ * again with more accurate stats before going.
+ */
+ migrc_try_flush_free_folios(ac->nodemask);
+ if (zone_watermark_fast(zone, order, mark,
+ ac->highest_zoneidx, alloc_flags, gfp_mask))
+ goto try_this_zone;
+
if (has_unaccepted_memory()) {
if (try_to_accept_memory(zone, order))
goto try_this_zone;
@@ -3911,6 +3926,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned int cpuset_mems_cookie;
unsigned int zonelist_iter_cookie;
int reserve_flags;
+ bool migrc_stopped = false;

restart:
compaction_retries = 0;
@@ -4042,6 +4058,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (page)
goto got_pg;

+ /*
+ * The system is in very high memory pressure. Stop migrc from
+ * expanding its pending queue and working temporarily.
+ */
+ if (!migrc_stopped) {
+ migrc_stop_pending();
+ migrc_stopped = true;
+ }
+
/* Caller is not willing to reclaim, we can't balance anything */
if (!can_direct_reclaim)
goto nopage;
@@ -4169,6 +4194,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
warn_alloc(gfp_mask, ac->nodemask,
"page allocation failure: order:%u", order);
got_pg:
+ if (migrc_stopped)
+ migrc_resume_pending();
return page;
}

diff --git a/mm/rmap.c b/mm/rmap.c
index a045f3b57c60..d0fa7bf66e70 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -606,6 +606,82 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,

#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH

+#ifdef CONFIG_MIGRC
+/*
+ * Gather folios and CPUs in cpumask to handle.
+ */
+static void migrc_gather(struct llist_head *folios,
+ struct arch_tlbflush_unmap_batch *arch,
+ struct llist_head *reqs)
+{
+ struct llist_node *reqs_nodes;
+ struct migrc_req *req;
+ struct migrc_req *req2;
+
+ reqs_nodes = llist_del_all(reqs);
+ if (!reqs_nodes)
+ return;
+
+ /*
+ * TODO: Optimize the time complexity.
+ */
+ llist_for_each_entry_safe(req, req2, reqs_nodes, llnode) {
+ struct llist_node *n;
+
+ arch_migrc_adj(&req->arch, req->gen);
+ arch_tlbbatch_fold(arch, &req->arch);
+
+ n = llist_del_all(&req->folios);
+ llist_add_batch(n, req->last, folios);
+ free_migrc_req(req);
+ }
+}
+
+bool migrc_try_flush_free_folios(nodemask_t *nodes)
+{
+ struct arch_tlbflush_unmap_batch arch;
+ LLIST_HEAD(folios);
+ bool ret = false;
+ int nid;
+
+ arch_tlbbatch_clean(&arch);
+ nodes = nodes ?: &node_possible_map;
+ for_each_node_mask(nid, *nodes)
+ migrc_gather(&folios, &arch, &NODE_DATA(nid)->migrc_reqs);
+
+ ret = !llist_empty(&folios);
+ if (ret) {
+ arch_tlbbatch_flush(&arch);
+ migrc_shrink(&folios);
+ }
+ return ret;
+}
+
+/*
+ * Fold ro batch data to the current migrc's request. In case that the
+ * request is not ready, fold it to the normal batch as a fallback.
+ */
+void fold_ubc_nowr_to_migrc(void)
+{
+ struct tlbflush_unmap_batch *tlb_ubc_nowr = &current->tlb_ubc_nowr;
+ struct migrc_req *req;
+
+ if (!tlb_ubc_nowr->nr_flush_required)
+ return;
+
+ req = current->mreq;
+ if (!req) {
+ /* fallback */
+ fold_ubc_nowr();
+ return;
+ }
+
+ arch_tlbbatch_fold(&req->arch, &tlb_ubc_nowr->arch);
+ tlb_ubc_nowr->nr_flush_required = 0;
+ tlb_ubc_nowr->writable = false;
+}
+#endif
+
void fold_ubc_nowr(void)
{
struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
--
2.17.1

2023-10-30 07:52:27

by Nadav Amit

[permalink] [raw]
Subject: Re: [v3 1/3] mm/rmap: Recognize non-writable TLB entries during TLB batch flush


Below are some points you might find useful:

> +
> /*
> * Blindly accessing user memory from NMI context can be dangerous
> * if we're in the middle of switching the current user task or
> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> index aa44fff8bb9d..35ba9425d48d 100644
> --- a/include/linux/mm_types_task.h
> +++ b/include/linux/mm_types_task.h
> @@ -59,8 +59,8 @@ struct tlbflush_unmap_batch {
> */
> struct arch_tlbflush_unmap_batch arch;
>
> - /* True if a flush is needed. */
> - bool flush_required;
> + /* The number of flush requested. */

Number of what? Base pages I presume.

> + int nr_flush_required;

Perhaps unsigned would be better suited?

>
> /*
> * If true then the PTE was dirty when unmapped. The entry must be
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 77f01ac385f7..63189c023357 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1324,6 +1324,7 @@ struct task_struct {
> #endif
>
> struct tlbflush_unmap_batch tlb_ubc;
> + struct tlbflush_unmap_batch tlb_ubc_nowr;

tlb_ubc_nowr is - I think - less informative the tlb_ubc_ro (and a comment
would be useful).

[snip]

>
> +
> +int nr_flush_required(void)
> +{
> + return current->tlb_ubc.nr_flush_required;
> +}
> +
> +int nr_flush_required_nowr(void)
> +{
> + return current->tlb_ubc_nowr.nr_flush_required;
> +}

I haven’t gone through the users of these functions yet, as they are not included
in this patch (which is usually not great).

Anyhow, it might be a bit wasteful to have a function call for such a function. See
if it is possible to avoid that call.

> +
> /*
> * Flush TLB entries for recently unmapped pages from remote CPUs. It is
> * important if a PTE was dirty when it was unmapped that it's flushed
> @@ -615,11 +641,12 @@ void try_to_unmap_flush(void)
> {
> struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>
> - if (!tlb_ubc->flush_required)
> + fold_ubc_nowr();
> + if (!tlb_ubc->nr_flush_required)
> return;
>
> arch_tlbbatch_flush(&tlb_ubc->arch);
> - tlb_ubc->flush_required = false;
> + tlb_ubc->nr_flush_required = 0;
> tlb_ubc->writable = false;
> }
>
> @@ -627,8 +654,9 @@ void try_to_unmap_flush(void)
> void try_to_unmap_flush_dirty(void)
> {
> struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
> + struct tlbflush_unmap_batch *tlb_ubc_nowr = &current->tlb_ubc_nowr;
>
> - if (tlb_ubc->writable)
> + if (tlb_ubc->writable || tlb_ubc_nowr->writable)
> try_to_unmap_flush();
> }
>
> @@ -645,15 +673,16 @@ void try_to_unmap_flush_dirty(void)
> static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
> unsigned long uaddr)
> {
> - struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
> + struct tlbflush_unmap_batch *tlb_ubc;
> int batch;
> bool writable = pte_dirty(pteval);
>
> if (!pte_accessible(mm, pteval))
> return;
>
> + tlb_ubc = pte_write(pteval) || writable ? &current->tlb_ubc : &current->tlb_ubc_nowr;

Using the ternary operator here is a bit confusing. You can use an “if”
instead or if you mind is set doing it this way at least make it easier to
read:

tlb_ubc = (pte_write(pteval) || writable) ? &current->tlb_ubc :
&current->tlb_ubc_nowr;

And of course, add a comment.

> arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
> - tlb_ubc->flush_required = true;
> + tlb_ubc->nr_flush_required += 1;

Presumably overflow is impossible for other reasons, but something like that
worries me.

2023-10-30 08:02:06

by David Hildenbrand

[permalink] [raw]
Subject: Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

On 30.10.23 08:25, Byungchul Park wrote:
> Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
> We always face the migration overhead at either promotion or demotion,
> while working with tiered memory e.g. CXL memory and found out TLB
> shootdown is a quite big one that is needed to get rid of if possible.
>
> Fortunately, TLB flush can be defered or even skipped if both source and
> destination of folios during migration are kept until all TLB flushes
> required will have been done, of course, only if the target PTE entries
> have read only permission, more precisely speaking, don't have write
> permission. Otherwise, no doubt the folio might get messed up.
>
> To achieve that:
>
> 1. For the folios that map only to non-writable TLB entries, prevent
> TLB flush at migration by keeping both source and destination
> folios, which will be handled later at a better time.
>
> 2. When any non-writable TLB entry changes to writable e.g. through
> fault handler, give up CONFIG_MIGRC mechanism so as to perform
> TLB flush required right away.
>
> 3. Temporarily stop migrc from working when the system is in very
> high memory pressure e.g. direct reclaim needed.
>
> The measurement result:
>
> Architecture - x86_64
> QEMU - kvm enabled, host cpu
> Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
> Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled
> Benchmark - XSBench -p 50000000 (-p option makes the runtime longer)
>
> run 'perf stat' using events:
> 1) itlb.itlb_flush
> 2) tlb_flush.dtlb_thread
> 3) tlb_flush.stlb_any
> 4) dTLB-load-misses
> 5) dTLB-store-misses
> 6) iTLB-load-misses
>
> run 'cat /proc/vmstat' and pick:
> 1) numa_pages_migrated
> 2) pgmigrate_success
> 3) nr_tlb_remote_flush
> 4) nr_tlb_remote_flush_received
> 5) nr_tlb_local_flush_all
> 6) nr_tlb_local_flush_one
>
> BEFORE - mainline v6.6-rc5
> ------------------------------------------
> $ perf stat -a \
> -e itlb.itlb_flush \
> -e tlb_flush.dtlb_thread \
> -e tlb_flush.stlb_any \
> -e dTLB-load-misses \
> -e dTLB-store-misses \
> -e iTLB-load-misses \
> ./XSBench -p 50000000
>
> Performance counter stats for 'system wide':
>
> 20953405 itlb.itlb_flush
> 114886593 tlb_flush.dtlb_thread
> 88267015 tlb_flush.stlb_any
> 115304095543 dTLB-load-misses
> 163904743 dTLB-store-misses
> 608486259 iTLB-load-misses
>
> 556.787113849 seconds time elapsed
>
> $ cat /proc/vmstat
>
> ...
> numa_pages_migrated 3378748
> pgmigrate_success 7720310
> nr_tlb_remote_flush 751464
> nr_tlb_remote_flush_received 10742115
> nr_tlb_local_flush_all 21899
> nr_tlb_local_flush_one 740157
> ...
>
> AFTER - mainline v6.6-rc5 + CONFIG_MIGRC
> ------------------------------------------
> $ perf stat -a \
> -e itlb.itlb_flush \
> -e tlb_flush.dtlb_thread \
> -e tlb_flush.stlb_any \
> -e dTLB-load-misses \
> -e dTLB-store-misses \
> -e iTLB-load-misses \
> ./XSBench -p 50000000
>
> Performance counter stats for 'system wide':
>
> 4353555 itlb.itlb_flush
> 72482780 tlb_flush.dtlb_thread
> 68226458 tlb_flush.stlb_any
> 114331610808 dTLB-load-misses
> 116084771 dTLB-store-misses
> 377180518 iTLB-load-misses
>
> 552.667718220 seconds time elapsed
>
> $ cat /proc/vmstat
>

So, an improvement of 0.74% ? How stable are the results? Serious
question: worth the churn?

Or did I get the numbers wrong?

> #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5c02720c53a5..1ca2ac91aa14 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -135,6 +135,9 @@ enum pageflags {
> #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> PG_arch_2,
> PG_arch_3,
> +#endif
> +#ifdef CONFIG_MIGRC
> + PG_migrc, /* Page has its copy under migrc's control */
> #endif
> __NR_PAGEFLAGS,
>
> @@ -589,6 +592,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
> PAGEFLAG(Idle, idle, PF_ANY)
> #endif
>
> +#ifdef CONFIG_MIGRC
> +PAGEFLAG(Migrc, migrc, PF_ANY)
> +#endif

I assume you know this: new pageflags are frowned upon.

--
Cheers,

David / dhildenb

2023-10-30 08:51:04

by Nadav Amit

[permalink] [raw]
Subject: Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

Hi Byungchul,

Few more comments. I think I lost concentration towards the end.

This patch should be broken into multiple patches, I think, and have more
comments.

I didn’t thoroughly review the entire “stopping” mechanism and other parts
of the code.


> On Oct 30, 2023, at 9:25 AM, Byungchul Park <[email protected]> wrote:
>
>
>
> +#ifdef CONFIG_MIGRC

Do you really need a new config option?

I think you rely on the UBC stuff and that should be the only config
option you rely on.

> +extern void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen);
> +#else
> +static inline void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen) {}
> +#endif
> +
> static inline bool pte_flags_need_flush(unsigned long oldflags,
> unsigned long newflags,
> bool ignore_access)
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 314cd9912a88..0bdb617ffe94 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1219,9 +1219,80 @@ STATIC_NOPV void native_flush_tlb_local(void)
> native_write_cr3(__native_read_cr3());
> }
>
> +#ifdef CONFIG_MIGRC
> +DEFINE_PER_CPU(int, migrc_done);

Any concern about overflow?

Following what I read later in the patch, you consider migrc_done as shared.
If that’s the case you should have used DECLARE_PER_CPU_SHARED_ALIGNED.

But I am not sure it should be shared, because that’s is used for an
optimization. See my comment below regarding this matter.

> +
> +static inline bool before(int a, int b)
> +{
> + return a - b < 0;
> +}

I think you should be able to manage without this function.

> +
> +static inline int migrc_tlb_local_begin(void)

Usually, it is better to leave it to the compiler to make inline decisions.
IOW, if there is no really good reason, I wouldn't use “inline" in “.c” files.

> +{
> + int ret = atomic_read(&migrc_gen);
> +
> + /*
> + * The following order should be guaranteed:
> + *
> + * 1. A migrc's request is queued at migration.
> + * 2. Each CPU performs __flush_tlb_local().
> + * 3. Marks this CPU has performed the TLB flush after 1.

(3) is not clear to me, specifically what “after 1” means.

> + *
> + * To achieve that, a kind of timestamp, migrc_gen was
> + * introduced to track the the order but conservatively, that
> + * is, it doesn't care whether TLB flush is performed more than
> + * it's needed but should guarantee TLB flush has been performed
> + * after the timestamp in migrc_done.
> + *
> + * 1 is done by migrc_unmap_done() and migrc_req_end().
> + * 2 is done by __flush_tlb_local().
> + * 3 is done by migrc_tlb_local_begin() and migrc_tlb_local_end().
> + *
> + * FYI, an example that should avoid is:

FYI?

> + *
> + * 1. Performs __flush_tlb_local().
> + * 2. A migrc's request is queued at migration.
> + * 3. Marks this CPU has performed the TLB flush after 2(?).

I think this example is not needed. But the comment can be improved.

> + *
> + * XXX: barrier() would be sufficient if the architecture
> + * already quarantees the order between memory access to
> + * migrc_gen and TLB flush.
> + */

Since UBC is used by two architectures IIRC, then is it needed or not?
Having an unnecessary smp_mb() is not great.

On x86 the memory write and the TLB flush are ordered. I think you also
regard the remote TLB flush and those should also be ordered (IIUC).

> + smp_mb();
> +
> + return ret;
> +}
> +
> +static inline void migrc_tlb_local_end(int gen)
> +{
> + /*
> + * XXX: barrier() would be sufficient if the architecture
> + * quarantees the order between TLB flush and memory access to
> + * migrc_done.
> + */
> + smp_mb();
> +
> + if (before(READ_ONCE(*this_cpu_ptr(&migrc_done)), gen))

I have no idea why READ_ONCE() is needed here, and the fact that you think
it is needed is a bit concerning.

> + WRITE_ONCE(*this_cpu_ptr(&migrc_done), gen);
> +}
> +#else
> +static inline int migrc_tlb_local_begin(void)
> +{
> + return 0;
> +}
> +
> +static inline void migrc_tlb_local_end(int gen)
> +{
> +}
> +#endif
> +
> void flush_tlb_local(void)
> {
> + unsigned int gen;
> +
> + gen = migrc_tlb_local_begin();
> __flush_tlb_local();
> + migrc_tlb_local_end(gen);
> }
>
> /*
> @@ -1246,6 +1317,21 @@ void __flush_tlb_all(void)
> }
> EXPORT_SYMBOL_GPL(__flush_tlb_all);
>
> +#ifdef CONFIG_MIGRC
> +/*
> + * Skip CPUs that have already performed TLB flushes required, which is
> + * tracked by migrc_gen and migrc_done.
> + */
> +void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen)
> +{
> + int cpu;
> +
> + for_each_cpu(cpu, &batch->cpumask)
> + if (!before(READ_ONCE(*per_cpu_ptr(&migrc_done, cpu)), gen))
> + cpumask_clear_cpu(cpu, &batch->cpumask);

This is an optimization, and I think it should have gone into a different patch
with separate performance benefit evaluation. Accessing remote CPUs caches to
decide whether to flush the remote TLB is a performance tradeoff, so this change
might not always be a win.

> +}
> +#endif
> +
> void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> {
> struct flush_tlb_info *info;
> @@ -1274,6 +1360,11 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> put_cpu();
> }
>
> +void arch_tlbbatch_clean(struct arch_tlbflush_unmap_batch *batch)

arch_tlbbatch_clear() perhaps? Or maybe even better name that describes what
it does in a higher level (i.e., when it should be used?)

> +{
> + cpumask_clear(&batch->cpumask);
> +}
> +
> void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
> struct arch_tlbflush_unmap_batch *bsrc)
> {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bf5d0b1b16f4..ddefe6033b66 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4062,4 +4062,29 @@ static inline void accept_memory(phys_addr_t start, phys_addr_t end)
>
> #endif
>
> +#ifdef CONFIG_MIGRC
> +void migrc_init_page(struct page *p);
> +void migrc_unpend_folio(struct folio *f);
> +bool migrc_pending_folio(struct folio *f);
> +void migrc_shrink(struct llist_head *h);
> +bool migrc_try_flush_free_folios(nodemask_t *nodes);
> +void fold_ubc_nowr_to_migrc(void);
> +void free_migrc_req(struct migrc_req *req);
> +int migrc_pending_nr_in_zone(struct zone *z);
> +void migrc_stop_pending(void);
> +void migrc_resume_pending(void);

How about some comments to describe when each of those should be used?

And do they need to be in mm.h or can they be in mm/internal.h ?

> +
> +extern atomic_t migrc_gen;
> +#else
> +static inline void migrc_init_page(struct page *p) {}
> +static inline void migrc_unpend_folio(struct folio *f) {}
> +static inline bool migrc_pending_folio(struct folio *f) { return false; }
> +static inline void migrc_shrink(struct llist_head *h) {}
> +static inline bool migrc_try_flush_free_folios(nodemask_t *nodes) { return false; }
> +static inline void fold_ubc_nowr_to_migrc(void) {}
> +static inline void free_migrc_req(struct migrc_req *req) {}
> +static inline int migrc_pending_nr_in_zone(struct zone *z) { return 0; }
> +static inline void migrc_stop_pending(void) {}
> +static inline void migrc_resume_pending(void) {}
> +#endif
> #endif /* _LINUX_MM_H */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 36c5b43999e6..55350f1d0db2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -101,6 +101,13 @@ struct page {
> /* Or, free page */
> struct list_head buddy_list;
> struct list_head pcp_list;
> +#ifdef CONFIG_MIGRC
> + /*
> + * Hang onto migrc's request queues,
> + * waiting for TLB flushes.
> + */
> + struct llist_node migrc_node;
> +#endif

Any reason for the use of llist?

> };
> /* See page-flags.h for PAGE_MAPPING_FLAGS */
> struct address_space *mapping;
> @@ -306,6 +313,13 @@ struct folio {
> /* private: */
> };
> /* public: */
> +#ifdef CONFIG_MIGRC
> + /*
> + * Hang onto migrc's request queues,
> + * waiting for TLB flushes.
> + */
> + struct llist_node migrc_node;
> +#endif
> };
> struct address_space *mapping;
> pgoff_t index;
> @@ -1372,4 +1386,39 @@ enum {
> /* See also internal only FOLL flags in mm/internal.h */
> };
>
> +#ifdef CONFIG_MIGRC
> +struct migrc_req {
> + /*
> + * folios pending for TLB flush
> + */
> + struct llist_head folios;
> +
> + /*
> + * the last folio in 'struct llist_head folios'
> + */
> + struct llist_node *last;

Sounds like some unconventional synchronization scheme is going to come
later. Sounds like trouble...

> +
> + /*
> + * for hanging to migrc's request queue, migrc_reqs
> + */
> + struct llist_node llnode;
> +
> + /*
> + * architecture specific batch data
> + */
> + struct arch_tlbflush_unmap_batch arch;
> +
> + /*
> + * timestamp when hanging to migrc's request queue, migrc_reqs
> + */
> + int gen;
> +
> + /*
> + * associated numa node
> + */
> + int nid;
> +};
> +#else
> +struct migrc_req {};
> +#endif
> #endif /* _LINUX_MM_TYPES_H */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4106fbc5b4b3..4e78d1c9ceb2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -980,6 +980,9 @@ struct zone {
> /* Zone statistics */
> atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
> atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
> +#ifdef CONFIG_MIGRC
> + atomic_t migrc_pending_nr;
> +#endif
> } ____cacheline_internodealigned_in_smp;
>
> enum pgdat_flags {
> @@ -1398,6 +1401,10 @@ typedef struct pglist_data {
> #ifdef CONFIG_MEMORY_FAILURE
> struct memory_failure_stats mf_stats;
> #endif
> +#ifdef CONFIG_MIGRC
> + atomic_t migrc_pending_nr;
> + struct llist_head migrc_reqs;
> +#endif
> } pg_data_t;
>
> #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5c02720c53a5..1ca2ac91aa14 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -135,6 +135,9 @@ enum pageflags {
> #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> PG_arch_2,
> PG_arch_3,
> +#endif
> +#ifdef CONFIG_MIGRC
> + PG_migrc, /* Page has its copy under migrc's control */

Well, you know that’s not great.

> #endif
> __NR_PAGEFLAGS,
>
> @@ -589,6 +592,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
> PAGEFLAG(Idle, idle, PF_ANY)
> #endif
>
> +#ifdef CONFIG_MIGRC
> +PAGEFLAG(Migrc, migrc, PF_ANY)
> +#endif
> +
> /*
> * PageReported() is used to track reported free pages within the Buddy
> * allocator. We can use the non-atomic version of the test and set
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 63189c023357..b7c402850b2f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1325,6 +1325,14 @@ struct task_struct {
>
> struct tlbflush_unmap_batch tlb_ubc;
> struct tlbflush_unmap_batch tlb_ubc_nowr;
> +#ifdef CONFIG_MIGRC
> + /*
> + * Temporarily used while a migration is on processing before
> + * hanging to the associated numa node at the end of the
> + * migration.
> + */
> + struct migrc_req *mreq;

Why is this indirection needed?

> +#endif
>
> /* Cache last used pipe for splice(): */
> struct pipe_inode_info *splice_pipe;
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 1478b9dd05fa..85b3b8859bc8 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -95,6 +95,12 @@
> #define IF_HAVE_PG_ARCH_X(_name)
> #endif
>
> +#ifdef CONFIG_MIGRC
> +#define IF_HAVE_PG_MIGRC(_name) ,{1UL << PG_##_name, __stringify(_name)}
> +#else
> +#define IF_HAVE_PG_MIGRC(_name)
> +#endif
> +
> #define DEF_PAGEFLAG_NAME(_name) { 1UL << PG_##_name, __stringify(_name) }
>
> #define __def_pageflag_names \
> @@ -125,7 +131,8 @@ IF_HAVE_PG_HWPOISON(hwpoison) \
> IF_HAVE_PG_IDLE(idle) \
> IF_HAVE_PG_IDLE(young) \
> IF_HAVE_PG_ARCH_X(arch_2) \
> -IF_HAVE_PG_ARCH_X(arch_3)
> +IF_HAVE_PG_ARCH_X(arch_3) \
> +IF_HAVE_PG_MIGRC(migrc)
>
> #define show_page_flags(flags) \
> (flags) ? __print_flags(flags, "|", \
> diff --git a/init/Kconfig b/init/Kconfig
> index 6d35728b94b2..4d88dab52059 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -908,6 +908,20 @@ config NUMA_BALANCING_DEFAULT_ENABLED
> If set, automatic NUMA balancing will be enabled if running on a NUMA
> machine.
>
> +config MIGRC
> + bool "Deferring TLB flush by keeping read copies on migration"
> + depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> + depends on NUMA_BALANCING
> + depends on 64BIT
> + default n
> + help
> + TLB flush is necessary when PTE changes by migration. However,
> + TLB flush can be deferred if both copies of the src page and
> + the dst page are kept until TLB flush if they are non-writable.
> + System performance will be improved especially in case that
> + promotion and demotion types of migrations are heavily
> + happening.

I don’t think users should have any control over this feature.

> +
> menuconfig CGROUPS
> bool "Control Group support"
> select KERNFS
> diff --git a/mm/memory.c b/mm/memory.c
> index 6c264d2f969c..75dc48b6e15f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3359,6 +3359,19 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> if (vmf->page)
> folio = page_folio(vmf->page);
>
> + /*
> + * This folio has its read copy to prevent inconsistency while
> + * deferring TLB flushes. However, the problem might arise if
> + * it's going to become writable.
> + *
> + * To prevent it, give up the deferring TLB flushes and perform
> + * TLB flush right away.
> + */
> + if (folio && migrc_pending_folio(folio)) {
> + migrc_unpend_folio(folio);
> + migrc_try_flush_free_folios(NULL);

So many potential function calls… Probably they should have been combined
into one and at least migrc_pending_folio() should have been an inline
function in the header.

> + }
> +

What about mprotect? I thought David has changed it so it can set writable
PTEs.

> /*
> * Shared mapping: we are guaranteed to have VM_WRITE and
> * FAULT_FLAG_WRITE set at this point.
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2053b54556ca..76278eca8417 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -57,6 +57,204 @@
>
> #include "internal.h"
>
> +#ifdef CONFIG_MIGRC
> +
> +atomic_t migrc_gen;

This would overflow.

> +
> +static struct migrc_req *alloc_migrc_req(void)
> +{
> + return kmalloc(sizeof(struct migrc_req), GFP_KERNEL);
> +}
> +
> +void free_migrc_req(struct migrc_req *req)
> +{
> + kfree(req);
> +}
> +
> +void migrc_init_page(struct page *p)
> +{
> + ClearPageMigrc(p);
> +}

So many wrappers for short functions. A bit of unnecessary overheads.

> +
> +/*
> + * One of migrc's core functions that frees up all the folios in a
> + * migrc's pending queue.
> + *
> + * The list should be isolated before.
> + */
> +void migrc_shrink(struct llist_head *h)

I am not sure by the name of it that it does what I originally thought it does
seeing the word “shrink” (I thought it should be a memory shrinker to free memory
when too many pages got their TLB flush deferred).

I don’t think it should be exposed for someone to call it without actual TLB
flush.

> +{
> + struct folio *f, *f2;
> + struct llist_node *n;
> +
> + n = llist_del_all(h);
> + llist_for_each_entry_safe(f, f2, n, migrc_node) {
> + struct pglist_data *node;
> + struct zone *zone;
> +
> + node = NODE_DATA(folio_nid(f));
> + zone = page_zone(&f->page);
> + atomic_dec(&node->migrc_pending_nr);
> + atomic_dec(&zone->migrc_pending_nr);

Why do you need both?

> + folio_put(f);
> + }
> +}
> +
> +void migrc_unpend_folio(struct folio *f)
> +{
> + folio_clear_migrc(f);
> +}
> +
> +bool migrc_pending_folio(struct folio *f)
> +{
> + return folio_test_migrc(f);
> +}
> +
> +/*
> + * Marks the folios as being under migrc's control.
> + */
> +static void migrc_mark_folios(struct folio *fsrc, struct folio *fdst)
> +{
> + if (!current->mreq)
> + return;
> +
> + folio_get(fsrc);
> + folio_set_migrc(fsrc);
> + folio_set_migrc(fdst);
> + fold_ubc_nowr_to_migrc();
> +}
> +
> +static bool migrc_marked_folios(struct folio *fsrc, struct folio *fdst)
> +{
> + /*
> + * XXX: The condition below is so tricky. Please suggest a
> + * better idea to determine if it's under migrc's control.
> + */
> + return migrc_pending_folio(fsrc) && migrc_pending_folio(fdst);

Tricky == racy?

I do not follow the ordering that ensures a migrated page would not
become writable before the migration is over. Perhaps a comment on what
guarantees the order would help.

> +}
> +
> +static void migrc_undo_folios(struct folio *fsrc, struct folio *fdst)
> +{
> + /*
> + * TLB flushes needed are already done at this moment so the
> + * flag doesn't have to be kept.
> + */
> + folio_clear_migrc(fsrc);
> + folio_clear_migrc(fdst);
> +
> + folio_put(fsrc);
> +}
> +
> +static void migrc_expand_req(struct folio *f)
> +{
> + struct migrc_req *req = current->mreq;
> + struct pglist_data *node;
> + struct zone *zone;
> +
> + if (!req)
> + return;
> +
> + /*
> + * Focusing on numa migration, all the nids in a req are same.
> + */
> + if (req->nid == -1)
> + req->nid = folio_nid(f);
> + else if (WARN_ON(req->nid != folio_nid(f)))
> + return;
> +
> + if (llist_add(&f->migrc_node, &req->folios))
> + req->last = &f->migrc_node;
> +
> + node = NODE_DATA(folio_nid(f));
> + zone = page_zone(&f->page);
> + atomic_inc(&node->migrc_pending_nr);
> + atomic_inc(&zone->migrc_pending_nr);
> +}
> +
> +static void migrc_unmap_done(void)
> +{
> + if (current->mreq)
> + current->mreq->gen = atomic_inc_return(&migrc_gen);
> +}
> +
> +/*
> + * Prepares for gathering folios pending for TLB flushes, try to
> + * allocate objects needed, initialize them and make them ready.
> + */
> +static void migrc_req_start(void)
> +{
> + struct migrc_req *req;
> +
> + if (WARN_ON(current->mreq))
> + return;
> +
> + req = alloc_migrc_req();
> + if (!req)
> + return;
> +
> + arch_tlbbatch_clean(&req->arch);
> + init_llist_head(&req->folios);
> + req->last = NULL;
> + req->nid = -1;
> + current->mreq = req;
> +}
> +
> +/*
> + * Hang the request with the collected folios to 'migrc_reqs' llist of
> + * the corresponding node.
> + */
> +static void migrc_req_end_if_started(void)
> +{
> + struct migrc_req *req = current->mreq;
> +
> + if (!req)
> + return;
> +
> + if (llist_empty(&req->folios))
> + free_migrc_req(req);
> + else
> + llist_add(&req->llnode, &NODE_DATA(req->nid)->migrc_reqs);
> +
> + current->mreq = NULL;
> +}
> +
> +int migrc_pending_nr_in_zone(struct zone *z)
> +{
> + return atomic_read(&z->migrc_pending_nr);
> +}
> +
> +static atomic_t migrc_stop = ATOMIC_INIT(0);

Comments?

> +
> +void migrc_stop_pending(void)
> +{
> + atomic_inc(&migrc_stop);
> +}
> +
> +void migrc_resume_pending(void)
> +{
> + atomic_dec(&migrc_stop);
> +}
> +
> +static inline bool migrc_stopped_pending(void)
> +{
> + return !!atomic_read(&migrc_stop);
> +}
> +
> +static bool can_migrc_system(void)
> +{
> + return !migrc_stopped_pending();
> +}
> +#else
> +static inline void migrc_mark_folios(struct folio *fsrc, struct folio *fdst) {}
> +static inline bool migrc_marked_folios(struct folio *fsrc, struct folio *fdst) { return false; }
> +static inline void migrc_undo_folios(struct folio *fsrc, struct folio *fdst) {}
> +static inline void migrc_expand_req(struct folio *f) {}
> +static inline void migrc_unmap_done(void) {}
> +static inline void migrc_req_start(void) {}
> +static inline void migrc_req_end_if_started(void) {}
> +static inline bool can_migrc_system(void) { return false; }
> +#endif
> +
> bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> {
> struct folio *folio = folio_get_nontail_page(page);
> @@ -379,6 +577,7 @@ static int folio_expected_refs(struct address_space *mapping,
> struct folio *folio)
> {
> int refs = 1;
> +
> if (!mapping)
> return refs;
>
> @@ -406,6 +605,13 @@ int folio_migrate_mapping(struct address_space *mapping,
> int expected_count = folio_expected_refs(mapping, folio) + extra_count;
> long nr = folio_nr_pages(folio);
>
> + /*
> + * Migrc mechanism increases the reference count to defer
> + * freeing up folios until a better time.
> + */
> + if (migrc_marked_folios(folio, newfolio))
> + expected_count++;
> +
> if (!mapping) {
> /* Anonymous page without mapping */
> if (folio_ref_count(folio) != expected_count)
> @@ -1620,16 +1826,30 @@ static int migrate_pages_batch(struct list_head *from,
> LIST_HEAD(unmap_folios);
> LIST_HEAD(dst_folios);
> bool nosplit = (reason == MR_NUMA_MISPLACED);
> + bool migrc_cond1;

Try to find better names.

>
> VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
> !list_empty(from) && !list_is_singular(from));
>
> + /*
> + * For now, apply migrc only to numa migration.
> + */
> + migrc_cond1 = can_migrc_system() &&
> + ((reason == MR_DEMOTION && current_is_kswapd()) ||
> + (reason == MR_NUMA_MISPLACED));
> +
> + if (migrc_cond1)
> + migrc_req_start();
> for (pass = 0; pass < nr_pass && retry; pass++) {
> retry = 0;
> thp_retry = 0;
> nr_retry_pages = 0;
>
> list_for_each_entry_safe(folio, folio2, from, lru) {
> + int nr_required;
> + bool migrc_cond2;
> + bool migrc;
> +
> is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
> nr_pages = folio_nr_pages(folio);
>
> @@ -1657,9 +1877,34 @@ static int migrate_pages_batch(struct list_head *from,
> continue;
> }
>
> + /*
> + * In case that the system is in high memory
> + * pressure, let's stop migrc from working and
> + * free up all the folios in the pending queues
> + * not to impact the system.
> + */
> + if (unlikely(migrc_cond1 && !can_migrc_system())) {
> + migrc_cond1 = false;
> + migrc_req_end_if_started();
> + migrc_try_flush_free_folios(NULL);
> + }
> +
> + nr_required = nr_flush_required();
> rc = migrate_folio_unmap(get_new_folio, put_new_folio,
> private, folio, &dst, mode, reason,
> ret_folios);
> + /*
> + * migrc_cond2 will be true only if:
> + *
> + * 1. There's no writable mapping at all
> + * 2. There's read only mapping found
> + * 3. Not already under migrc's control
> + */
> + migrc_cond2 = nr_required == nr_flush_required() &&
> + nr_flush_required_nowr() &&
> + !migrc_pending_folio(folio);
> + migrc = migrc_cond1 && migrc_cond2;
> +
> /*
> * The rules are:
> * Success: folio will be freed
> @@ -1720,6 +1965,9 @@ static int migrate_pages_batch(struct list_head *from,
> case MIGRATEPAGE_UNMAP:
> list_move_tail(&folio->lru, &unmap_folios);
> list_add_tail(&dst->lru, &dst_folios);
> +
> + if (migrc)
> + migrc_mark_folios(folio, dst);
> break;
> default:
> /*
> @@ -1733,12 +1981,24 @@ static int migrate_pages_batch(struct list_head *from,
> stats->nr_failed_pages += nr_pages;
> break;
> }
> +
> + /*
> + * Done with the current folio. Fold the ro
> + * batch data gathered, to the normal batch.
> + */
> + fold_ubc_nowr();
> }
> }
> nr_failed += retry;
> stats->nr_thp_failed += thp_retry;
> stats->nr_failed_pages += nr_retry_pages;
> move:
> + /*
> + * Generate a new timestamp that will be used to filter out
> + * CPUs that have already performed TLB flushes needed.
> + */
> + migrc_unmap_done();
> +
> /* Flush TLBs for all unmapped folios */
> try_to_unmap_flush();
>
> @@ -1774,6 +2034,9 @@ static int migrate_pages_batch(struct list_head *from,
> case MIGRATEPAGE_SUCCESS:
> stats->nr_succeeded += nr_pages;
> stats->nr_thp_succeeded += is_thp;
> +
> + if (migrc_marked_folios(folio, dst))
> + migrc_expand_req(folio);
> break;
> default:
> nr_failed++;
> @@ -1791,6 +2054,8 @@ static int migrate_pages_batch(struct list_head *from,
>
> rc = rc_saved ? : nr_failed;
> out:
> + migrc_req_end_if_started();
> +
> /* Cleanup remaining folios */
> dst = list_first_entry(&dst_folios, struct folio, lru);
> dst2 = list_next_entry(dst, lru);
> @@ -1798,6 +2063,9 @@ static int migrate_pages_batch(struct list_head *from,
> int page_was_mapped = 0;
> struct anon_vma *anon_vma = NULL;
>
> + if (migrc_marked_folios(folio, dst))
> + migrc_undo_folios(folio, dst);
> +
> __migrate_folio_extract(dst, &page_was_mapped, &anon_vma);
> migrate_folio_undo_src(folio, page_was_mapped, anon_vma,
> true, ret_folios);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 95546f376302..2fa84c3a9681 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1535,6 +1535,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>
> set_page_owner(page, order, gfp_flags);
> page_table_check_alloc(page, order);
> +
> + for (i = 0; i != 1 << order; ++i)
> + migrc_init_page(page + i);

Do we really need an additional atomic operation to clear the page-flag here?

> }
>
> static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> @@ -2839,6 +2842,8 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> long min = mark;
> int o;
>
> + free_pages += migrc_pending_nr_in_zone(z);
> +
> /* free_pages may go negative - that's OK */
> free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
>
> @@ -2933,7 +2938,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
> long usable_free;
> long reserved;
>
> - usable_free = free_pages;
> + usable_free = free_pages + migrc_pending_nr_in_zone(z);
> reserved = __zone_watermark_unusable_free(z, 0, alloc_flags);
>
> /* reserved may over estimate high-atomic reserves. */
> @@ -3121,6 +3126,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> gfp_mask)) {
> int ret;
>
> + /*
> + * Free the pending folios so that the remaining
> + * code can use them. Check zone_watermark_fast()
> + * again with more accurate stats before going.
> + */
> + migrc_try_flush_free_folios(ac->nodemask);
> + if (zone_watermark_fast(zone, order, mark,
> + ac->highest_zoneidx, alloc_flags, gfp_mask))
> + goto try_this_zone;
> +
> if (has_unaccepted_memory()) {
> if (try_to_accept_memory(zone, order))
> goto try_this_zone;
> @@ -3911,6 +3926,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> unsigned int cpuset_mems_cookie;
> unsigned int zonelist_iter_cookie;
> int reserve_flags;
> + bool migrc_stopped = false;
>
> restart:
> compaction_retries = 0;
> @@ -4042,6 +4058,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> if (page)
> goto got_pg;
>
> + /*
> + * The system is in very high memory pressure. Stop migrc from
> + * expanding its pending queue and working temporarily.
> + */
> + if (!migrc_stopped) {
> + migrc_stop_pending();
> + migrc_stopped = true;
> + }
> +
> /* Caller is not willing to reclaim, we can't balance anything */
> if (!can_direct_reclaim)
> goto nopage;
> @@ -4169,6 +4194,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> warn_alloc(gfp_mask, ac->nodemask,
> "page allocation failure: order:%u", order);
> got_pg:
> + if (migrc_stopped)
> + migrc_resume_pending();
> return page;
> }
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a045f3b57c60..d0fa7bf66e70 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -606,6 +606,82 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
>
> #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
>
> +#ifdef CONFIG_MIGRC
> +/*
> + * Gather folios and CPUs in cpumask to handle.
> + */
> +static void migrc_gather(struct llist_head *folios,
> + struct arch_tlbflush_unmap_batch *arch,
> + struct llist_head *reqs)
> +{
> + struct llist_node *reqs_nodes;
> + struct migrc_req *req;
> + struct migrc_req *req2;
> +
> + reqs_nodes = llist_del_all(reqs);
> + if (!reqs_nodes)
> + return;
> +
> + /*
> + * TODO: Optimize the time complexity.
> + */

Please mark the patch as an RFC until no more TODOs.

> + llist_for_each_entry_safe(req, req2, reqs_nodes, llnode) {
> + struct llist_node *n;
> +
> + arch_migrc_adj(&req->arch, req->gen);
> + arch_tlbbatch_fold(arch, &req->arch);
> +
> + n = llist_del_all(&req->folios);
> + llist_add_batch(n, req->last, folios);
> + free_migrc_req(req);
> + }
> +}
> +
> +bool migrc_try_flush_free_folios(nodemask_t *nodes)
> +{
> + struct arch_tlbflush_unmap_batch arch;
> + LLIST_HEAD(folios);

Again, I don’t understand why llist is needed.

> + bool ret = false;
> + int nid;
> +
> + arch_tlbbatch_clean(&arch);
> + nodes = nodes ?: &node_possible_map;

I would just use an if

> + for_each_node_mask(nid, *nodes)
> + migrc_gather(&folios, &arch, &NODE_DATA(nid)->migrc_reqs);
> +
> + ret = !llist_empty(&folios);
> + if (ret) {
> + arch_tlbbatch_flush(&arch);
> + migrc_shrink(&folios);
> + }
> + return ret;
> +}
> +
> +/*
> + * Fold ro batch data to the current migrc's request. In case that the
> + * request is not ready, fold it to the normal batch as a fallback.
> + */
> +void fold_ubc_nowr_to_migrc(void)
> +{
> + struct tlbflush_unmap_batch *tlb_ubc_nowr = &current->tlb_ubc_nowr;
> + struct migrc_req *req;
> +
> + if (!tlb_ubc_nowr->nr_flush_required)
> + return;
> +
> + req = current->mreq;
> + if (!req) {
> + /* fallback */
> + fold_ubc_nowr();
> + return;
> + }
> +
> + arch_tlbbatch_fold(&req->arch, &tlb_ubc_nowr->arch);
> + tlb_ubc_nowr->nr_flush_required = 0;
> + tlb_ubc_nowr->writable = false;
> +}
> +#endif
> +
> void fold_ubc_nowr(void)
> {
> struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
> --
> 2.17.1

2023-10-30 08:51:51

by Nadav Amit

[permalink] [raw]
Subject: Re: [v3 3/3] mm, migrc: Add a sysctl knob to enable/disable MIGRC mechanism


> On Oct 30, 2023, at 9:25 AM, Byungchul Park <[email protected]> wrote:
>
> Add a sysctl knob, '/proc/sys/vm/migrc_enable' to switch on/off migrc.

Please explain how users are expected to use this knob.

I suspect that the knob and the config option are not useful. You probably
used them for your evaluation or as a “chicken-bit”, but they are not
useful anymore.

2023-10-30 09:59:51

by Byungchul Park

[permalink] [raw]
Subject: Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

On Mon, Oct 30, 2023 at 09:00:56AM +0100, David Hildenbrand wrote:
> On 30.10.23 08:25, Byungchul Park wrote:
> > Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
> > We always face the migration overhead at either promotion or demotion,
> > while working with tiered memory e.g. CXL memory and found out TLB
> > shootdown is a quite big one that is needed to get rid of if possible.
> >
> > Fortunately, TLB flush can be defered or even skipped if both source and
> > destination of folios during migration are kept until all TLB flushes
> > required will have been done, of course, only if the target PTE entries
> > have read only permission, more precisely speaking, don't have write
> > permission. Otherwise, no doubt the folio might get messed up.
> >
> > To achieve that:
> >
> > 1. For the folios that map only to non-writable TLB entries, prevent
> > TLB flush at migration by keeping both source and destination
> > folios, which will be handled later at a better time.
> >
> > 2. When any non-writable TLB entry changes to writable e.g. through
> > fault handler, give up CONFIG_MIGRC mechanism so as to perform
> > TLB flush required right away.
> >
> > 3. Temporarily stop migrc from working when the system is in very
> > high memory pressure e.g. direct reclaim needed.
> >
> > The measurement result:
> >
> > Architecture - x86_64
> > QEMU - kvm enabled, host cpu
> > Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
> > Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled
> > Benchmark - XSBench -p 50000000 (-p option makes the runtime longer)
> >
> > run 'perf stat' using events:
> > 1) itlb.itlb_flush
> > 2) tlb_flush.dtlb_thread
> > 3) tlb_flush.stlb_any
> > 4) dTLB-load-misses
> > 5) dTLB-store-misses
> > 6) iTLB-load-misses
> >
> > run 'cat /proc/vmstat' and pick:
> > 1) numa_pages_migrated
> > 2) pgmigrate_success
> > 3) nr_tlb_remote_flush
> > 4) nr_tlb_remote_flush_received
> > 5) nr_tlb_local_flush_all
> > 6) nr_tlb_local_flush_one
> >
> > BEFORE - mainline v6.6-rc5
> > ------------------------------------------
> > $ perf stat -a \
> > -e itlb.itlb_flush \
> > -e tlb_flush.dtlb_thread \
> > -e tlb_flush.stlb_any \
> > -e dTLB-load-misses \
> > -e dTLB-store-misses \
> > -e iTLB-load-misses \
> > ./XSBench -p 50000000
> >
> > Performance counter stats for 'system wide':
> >
> > 20953405 itlb.itlb_flush
> > 114886593 tlb_flush.dtlb_thread
> > 88267015 tlb_flush.stlb_any
> > 115304095543 dTLB-load-misses
> > 163904743 dTLB-store-misses
> > 608486259 iTLB-load-misses
> >
> > 556.787113849 seconds time elapsed
> >
> > $ cat /proc/vmstat
> >
> > ...
> > numa_pages_migrated 3378748
> > pgmigrate_success 7720310
> > nr_tlb_remote_flush 751464
> > nr_tlb_remote_flush_received 10742115
> > nr_tlb_local_flush_all 21899
> > nr_tlb_local_flush_one 740157
> > ...
> >
> > AFTER - mainline v6.6-rc5 + CONFIG_MIGRC
> > ------------------------------------------
> > $ perf stat -a \
> > -e itlb.itlb_flush \
> > -e tlb_flush.dtlb_thread \
> > -e tlb_flush.stlb_any \
> > -e dTLB-load-misses \
> > -e dTLB-store-misses \
> > -e iTLB-load-misses \
> > ./XSBench -p 50000000
> >
> > Performance counter stats for 'system wide':
> >
> > 4353555 itlb.itlb_flush
> > 72482780 tlb_flush.dtlb_thread
> > 68226458 tlb_flush.stlb_any
> > 114331610808 dTLB-load-misses
> > 116084771 dTLB-store-misses
> > 377180518 iTLB-load-misses
> >
> > 552.667718220 seconds time elapsed
> >
> > $ cat /proc/vmstat
> >
>
> So, an improvement of 0.74% ? How stable are the results? Serious question:

I'm getting very stable result.

> worth the churn?

Yes, ultimately the time wise improvement should be observed. However,
I've been focusing on the numbers of TLB flushes and TLB misses because
better result in terms of total time will be followed depending on the
test condition. We can see the result if we test with a system that:

1. has more CPUs that would induce a crazy number of IPIs.
2. has slow memories that makes TLB miss overhead bigger.
3. runs workloads that is harmful at TLB miss and IPI storm.
4. runs workloads that causes heavier numa migrations.
5. runs workloads that has a lot of read only permission mappings.
6. and so on.

I will share the results once I manage to meet the conditions.

By the way, I should've added IPI reduction because it also has super
big delta :)

> Or did I get the numbers wrong?
>
> > #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 5c02720c53a5..1ca2ac91aa14 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -135,6 +135,9 @@ enum pageflags {
> > #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> > PG_arch_2,
> > PG_arch_3,
> > +#endif
> > +#ifdef CONFIG_MIGRC
> > + PG_migrc, /* Page has its copy under migrc's control */
> > #endif
> > __NR_PAGEFLAGS,
> > @@ -589,6 +592,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
> > PAGEFLAG(Idle, idle, PF_ANY)
> > #endif
> > +#ifdef CONFIG_MIGRC
> > +PAGEFLAG(Migrc, migrc, PF_ANY)
> > +#endif
>
> I assume you know this: new pageflags are frowned upon.

Sorry for that. I really didn't want to add a new headache.

Byungchul

2023-10-30 10:26:48

by Byungchul Park

[permalink] [raw]
Subject: Re: [v3 1/3] mm/rmap: Recognize non-writable TLB entries during TLB batch flush

On Mon, Oct 30, 2023 at 07:52:05AM +0000, Nadav Amit wrote:
>
> Below are some points you might find useful:

Thank you!

> > +
> > /*
> > * Blindly accessing user memory from NMI context can be dangerous
> > * if we're in the middle of switching the current user task or
> > diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> > index aa44fff8bb9d..35ba9425d48d 100644
> > --- a/include/linux/mm_types_task.h
> > +++ b/include/linux/mm_types_task.h
> > @@ -59,8 +59,8 @@ struct tlbflush_unmap_batch {
> > */
> > struct arch_tlbflush_unmap_batch arch;
> >
> > - /* True if a flush is needed. */
> > - bool flush_required;
> > + /* The number of flush requested. */
>
> Number of what? Base pages I presume.

How many times set_tlb_ubc_flush_pending() has been called.

> > + int nr_flush_required;
>
> Perhaps unsigned would be better suited?

Will change it to unsigned.

> > /*
> > * If true then the PTE was dirty when unmapped. The entry must be
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 77f01ac385f7..63189c023357 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1324,6 +1324,7 @@ struct task_struct {
> > #endif
> >
> > struct tlbflush_unmap_batch tlb_ubc;
> > + struct tlbflush_unmap_batch tlb_ubc_nowr;
>
> tlb_ubc_nowr is - I think - less informative the tlb_ubc_ro (and a comment
> would be useful).

At the beginning, I named it tlb_ubc_ro but.. I forgot why I changed it
to tlb_ubc_nowr but.. I will change it back and add a comment on it.

> > +
> > +int nr_flush_required(void)
> > +{
> > + return current->tlb_ubc.nr_flush_required;
> > +}
> > +
> > +int nr_flush_required_nowr(void)
> > +{
> > + return current->tlb_ubc_nowr.nr_flush_required;
> > +}
>
> I haven’t gone through the users of these functions yet, as they are not included
> in this patch (which is usually not great).

Right. I will place these two on another patch that uses the functions.
Or need to add an explanation in this commit message.

> Anyhow, it might be a bit wasteful to have a function call for such a function. See
> if it is possible to avoid that call.

I will move them to mm/internal.h with inline added if possible.

> > +
> > /*
> > * Flush TLB entries for recently unmapped pages from remote CPUs. It is
> > * important if a PTE was dirty when it was unmapped that it's flushed
> > @@ -615,11 +641,12 @@ void try_to_unmap_flush(void)
> > {
> > struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
> >
> > - if (!tlb_ubc->flush_required)
> > + fold_ubc_nowr();
> > + if (!tlb_ubc->nr_flush_required)
> > return;
> >
> > arch_tlbbatch_flush(&tlb_ubc->arch);
> > - tlb_ubc->flush_required = false;
> > + tlb_ubc->nr_flush_required = 0;
> > tlb_ubc->writable = false;
> > }
> >
> > @@ -627,8 +654,9 @@ void try_to_unmap_flush(void)
> > void try_to_unmap_flush_dirty(void)
> > {
> > struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
> > + struct tlbflush_unmap_batch *tlb_ubc_nowr = &current->tlb_ubc_nowr;
> >
> > - if (tlb_ubc->writable)
> > + if (tlb_ubc->writable || tlb_ubc_nowr->writable)
> > try_to_unmap_flush();
> > }
> >
> > @@ -645,15 +673,16 @@ void try_to_unmap_flush_dirty(void)
> > static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
> > unsigned long uaddr)
> > {
> > - struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
> > + struct tlbflush_unmap_batch *tlb_ubc;
> > int batch;
> > bool writable = pte_dirty(pteval);
> >
> > if (!pte_accessible(mm, pteval))
> > return;
> >
> > + tlb_ubc = pte_write(pteval) || writable ? &current->tlb_ubc : &current->tlb_ubc_nowr;
>
> Using the ternary operator here is a bit confusing. You can use an “if”
> instead or if you mind is set doing it this way at least make it easier to
> read:
>
> tlb_ubc = (pte_write(pteval) || writable) ? &current->tlb_ubc :
> &current->tlb_ubc_nowr;

You are right. I should change it that way. Thanks.

> And of course, add a comment.

Okay. Also will add a comment.

> > arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
> > - tlb_ubc->flush_required = true;
> > + tlb_ubc->nr_flush_required += 1;
>
> Presumably overflow is impossible for other reasons, but something like that
> worries me.

Agree with you. Lemme think it more and fix it.

Thank you.

Byungchul

2023-10-30 10:37:36

by Byungchul Park

[permalink] [raw]
Subject: Re: [v3 3/3] mm, migrc: Add a sysctl knob to enable/disable MIGRC mechanism

On Mon, Oct 30, 2023 at 08:51:35AM +0000, Nadav Amit wrote:
>
> > On Oct 30, 2023, at 9:25 AM, Byungchul Park <[email protected]> wrote:
> >
> > Add a sysctl knob, '/proc/sys/vm/migrc_enable' to switch on/off migrc.
>
> Please explain how users are expected to use this knob.
>
> I suspect that the knob and the config option are not useful. You probably
> used them for your evaluation or as a “chicken-bit”, but they are not
> useful anymore.

Okay. We can add the sysctl knob when we need it. Will remove it for now.

However, I'm not sure if it'd be okay without CONFIG_MIGRC. I'd like to
see more opinion on it before going with it.

Byungchul

2023-10-30 12:52:05

by Byungchul Park

[permalink] [raw]
Subject: Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

On Mon, Oct 30, 2023 at 08:50:20AM +0000, Nadav Amit wrote:
> Hi Byungchul,
>
> Few more comments. I think I lost concentration towards the end.
>
> This patch should be broken into multiple patches, I think, and have more
> comments.
>
> I didn’t thoroughly review the entire “stopping” mechanism and other parts
> of the code.

I really spent a lot of time hesitating whether splitting it or not.
However, the test result cannot be stable without those. I'm still
confused. I think the test result should be stable at each commit,
right?

> > On Oct 30, 2023, at 9:25 AM, Byungchul Park <[email protected]> wrote:
> >
> >
> >
> > +#ifdef CONFIG_MIGRC
>
> Do you really need a new config option?

I will remove the config if more guys give the same opinion. Thank you
for the opinion anyway.

> I think you rely on the UBC stuff and that should be the only config
> option you rely on.

CONFIG_64BIT too, for a new page flag.

> > +#else
> > +static inline void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen) {}
> > +#endif
> > +
> > static inline bool pte_flags_need_flush(unsigned long oldflags,
> > unsigned long newflags,
> > bool ignore_access)
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 314cd9912a88..0bdb617ffe94 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -1219,9 +1219,80 @@ STATIC_NOPV void native_flush_tlb_local(void)
> > native_write_cr3(__native_read_cr3());
> > }
> >
> > +#ifdef CONFIG_MIGRC
> > +DEFINE_PER_CPU(int, migrc_done);
>
> Any concern about overflow?

No problem because I use the before() funtion below to compare two
int values.

> Following what I read later in the patch, you consider migrc_done as shared.
> If that’s the case you should have used DECLARE_PER_CPU_SHARED_ALIGNED.

Each CPU updates its local migrc_done while any CPU at any time can
happen to read all the CPUs' migrc_done at once. I will change it to
DECLARE_PER_CPU_SHARED_ALIGNED.

> But I am not sure it should be shared, because that’s is used for an
> optimization. See my comment below regarding this matter.
>
> > +
> > +static inline bool before(int a, int b)
> > +{
> > + return a - b < 0;
> > +}
>
> I think you should be able to manage without this function.

How? AFAIK, generation numbers are usually handled this way. Am I
missing something?

> > +
> > +static inline int migrc_tlb_local_begin(void)
>
> Usually, it is better to leave it to the compiler to make inline decisions.
> IOW, if there is no really good reason, I wouldn't use “inline" in “.c” files.

Hm.. Okay I will.

> > +{
> > + int ret = atomic_read(&migrc_gen);
> > +
> > + /*
> > + * The following order should be guaranteed:
> > + *
> > + * 1. A migrc's request is queued at migration.
> > + * 2. Each CPU performs __flush_tlb_local().
> > + * 3. Marks this CPU has performed the TLB flush after 1.
>
> (3) is not clear to me, specifically what “after 1” means.

It means 'whether TLB flush on this CPU required by a migrc's request
has been performed *since* the request was queued?'.

> > + *
> > + * To achieve that, a kind of timestamp, migrc_gen was
> > + * introduced to track the the order but conservatively, that
> > + * is, it doesn't care whether TLB flush is performed more than
> > + * it's needed but should guarantee TLB flush has been performed
> > + * after the timestamp in migrc_done.
> > + *
> > + * 1 is done by migrc_unmap_done() and migrc_req_end().
> > + * 2 is done by __flush_tlb_local().
> > + * 3 is done by migrc_tlb_local_begin() and migrc_tlb_local_end().
> > + *
> > + * FYI, an example that should avoid is:
>
> FYI?
>
> > + *
> > + * 1. Performs __flush_tlb_local().
> > + * 2. A migrc's request is queued at migration.
> > + * 3. Marks this CPU has performed the TLB flush after 2(?).
>
> I think this example is not needed. But the comment can be improved.

I also think this is not necessary but I added it for better
understanding. So you are asking me to improve the comment right? Lemme
try it.

> > + *
> > + * XXX: barrier() would be sufficient if the architecture
> > + * already quarantees the order between memory access to
> > + * migrc_gen and TLB flush.
> > + */
>
> Since UBC is used by two architectures IIRC, then is it needed or not?
> Having an unnecessary smp_mb() is not great.

Totally right.

> On x86 the memory write and the TLB flush are ordered. I think you also

What about a memory read and TLB flush? If yes too, then I can use
barrier() instead of smp_mb(). Thanks a lot!

> regard the remote TLB flush and those should also be ordered (IIUC).
>
> > + smp_mb();
> > +
> > + return ret;
> > +}
> > +
> > +static inline void migrc_tlb_local_end(int gen)
> > +{
> > + /*
> > + * XXX: barrier() would be sufficient if the architecture
> > + * quarantees the order between TLB flush and memory access to
> > + * migrc_done.
> > + */
> > + smp_mb();
> > +
> > + if (before(READ_ONCE(*this_cpu_ptr(&migrc_done)), gen))
>
> I have no idea why READ_ONCE() is needed here, and the fact that you think
> it is needed is a bit concerning.

I was worried if we can quarantee the read *actually* accesses the
memory every time when repeatedly getting here. However, yeah.. I don't
have to.. I think the compilers would work well for that. I will remove
it. Thank you.

> > + WRITE_ONCE(*this_cpu_ptr(&migrc_done), gen);
> > +}
> > +#else
> > +static inline int migrc_tlb_local_begin(void)
> > +{
> > + return 0;
> > +}
> > +
> > +static inline void migrc_tlb_local_end(int gen)
> > +{
> > +}
> > +#endif
> > +
> > void flush_tlb_local(void)
> > {
> > + unsigned int gen;
> > +
> > + gen = migrc_tlb_local_begin();
> > __flush_tlb_local();
> > + migrc_tlb_local_end(gen);
> > }
> >
> > /*
> > @@ -1246,6 +1317,21 @@ void __flush_tlb_all(void)
> > }
> > EXPORT_SYMBOL_GPL(__flush_tlb_all);
> >
> > +#ifdef CONFIG_MIGRC
> > +/*
> > + * Skip CPUs that have already performed TLB flushes required, which is
> > + * tracked by migrc_gen and migrc_done.
> > + */
> > +void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen)
> > +{
> > + int cpu;
> > +
> > + for_each_cpu(cpu, &batch->cpumask)
> > + if (!before(READ_ONCE(*per_cpu_ptr(&migrc_done, cpu)), gen))
> > + cpumask_clear_cpu(cpu, &batch->cpumask);
>
> This is an optimization, and I think it should have gone into a different patch
> with separate performance benefit evaluation. Accessing remote CPUs caches to
> decide whether to flush the remote TLB is a performance tradeoff, so this change
> might not always be a win.

Right. I will.

> > +}
> > +#endif
> > +
> > void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > {
> > struct flush_tlb_info *info;
> > @@ -1274,6 +1360,11 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > put_cpu();
> > }
> >
> > +void arch_tlbbatch_clean(struct arch_tlbflush_unmap_batch *batch)
>
> arch_tlbbatch_clear() perhaps? Or maybe even better name that describes what
> it does in a higher level (i.e., when it should be used?)

I use this at each preparation phase when I want to gather batched
information e.g. to gather which CPUs need TLB flush.

arch_tlbbatch_init()
arch_tlbbatch_clear()
arch_tlbbatch_reset()
arch_tlbbatch_?()

> > +{
> > + cpumask_clear(&batch->cpumask);
> > +}
> > +
> > void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
> > struct arch_tlbflush_unmap_batch *bsrc)
> > {
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index bf5d0b1b16f4..ddefe6033b66 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4062,4 +4062,29 @@ static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> >
> > #endif
> >
> > +#ifdef CONFIG_MIGRC
> > +void migrc_init_page(struct page *p);
> > +void migrc_unpend_folio(struct folio *f);
> > +bool migrc_pending_folio(struct folio *f);
> > +void migrc_shrink(struct llist_head *h);
> > +bool migrc_try_flush_free_folios(nodemask_t *nodes);
> > +void fold_ubc_nowr_to_migrc(void);
> > +void free_migrc_req(struct migrc_req *req);
> > +int migrc_pending_nr_in_zone(struct zone *z);
> > +void migrc_stop_pending(void);
> > +void migrc_resume_pending(void);
>
> How about some comments to describe when each of those should be used?

Sounds great. I will add comments.

> And do they need to be in mm.h or can they be in mm/internal.h ?

Otherwise, architecture code like x86 cannot use these functions.

> > +extern atomic_t migrc_gen;
> > +#else
> > +static inline void migrc_init_page(struct page *p) {}
> > +static inline void migrc_unpend_folio(struct folio *f) {}
> > +static inline bool migrc_pending_folio(struct folio *f) { return false; }
> > +static inline void migrc_shrink(struct llist_head *h) {}
> > +static inline bool migrc_try_flush_free_folios(nodemask_t *nodes) { return false; }
> > +static inline void fold_ubc_nowr_to_migrc(void) {}
> > +static inline void free_migrc_req(struct migrc_req *req) {}
> > +static inline int migrc_pending_nr_in_zone(struct zone *z) { return 0; }
> > +static inline void migrc_stop_pending(void) {}
> > +static inline void migrc_resume_pending(void) {}
> > +#endif
> > #endif /* _LINUX_MM_H */
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 36c5b43999e6..55350f1d0db2 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -101,6 +101,13 @@ struct page {
> > /* Or, free page */
> > struct list_head buddy_list;
> > struct list_head pcp_list;
> > +#ifdef CONFIG_MIGRC
> > + /*
> > + * Hang onto migrc's request queues,
> > + * waiting for TLB flushes.
> > + */
> > + struct llist_node migrc_node;
> > +#endif
>
> Any reason for the use of llist?

For lockless design at the beginning.. but for now.. you are right. I
don't have to keep it as llist. I will change it so as to use lru for
that purpose instead.

I'm suprised a lot how much you understand the code. Respect.

> > };
> > /* See page-flags.h for PAGE_MAPPING_FLAGS */
> > struct address_space *mapping;
> > @@ -306,6 +313,13 @@ struct folio {
> > /* private: */
> > };
> > /* public: */
> > +#ifdef CONFIG_MIGRC
> > + /*
> > + * Hang onto migrc's request queues,
> > + * waiting for TLB flushes.
> > + */
> > + struct llist_node migrc_node;
> > +#endif
> > };
> > struct address_space *mapping;
> > pgoff_t index;
> > @@ -1372,4 +1386,39 @@ enum {
> > /* See also internal only FOLL flags in mm/internal.h */
> > };
> >
> > +#ifdef CONFIG_MIGRC
> > +struct migrc_req {
> > + /*
> > + * folios pending for TLB flush
> > + */
> > + struct llist_head folios;
> > +
> > + /*
> > + * the last folio in 'struct llist_head folios'
> > + */
> > + struct llist_node *last;
>
> Sounds like some unconventional synchronization scheme is going to come
> later. Sounds like trouble...

I don't think so. What would make trouble with this? Anyway, I wouldn't
need this anymore if I use list_head instead of llist.

> > +
> > + /*
> > + * for hanging to migrc's request queue, migrc_reqs
> > + */
> > + struct llist_node llnode;
> > +
> > + /*
> > + * architecture specific batch data
> > + */
> > + struct arch_tlbflush_unmap_batch arch;
> > +
> > + /*
> > + * timestamp when hanging to migrc's request queue, migrc_reqs
> > + */
> > + int gen;
> > +
> > + /*
> > + * associated numa node
> > + */
> > + int nid;
> > +};
> > +#else
> > +struct migrc_req {};
> > +#endif
> > #endif /* _LINUX_MM_TYPES_H */
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 4106fbc5b4b3..4e78d1c9ceb2 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -980,6 +980,9 @@ struct zone {
> > /* Zone statistics */
> > atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
> > atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
> > +#ifdef CONFIG_MIGRC
> > + atomic_t migrc_pending_nr;
> > +#endif
> > } ____cacheline_internodealigned_in_smp;
> >
> > enum pgdat_flags {
> > @@ -1398,6 +1401,10 @@ typedef struct pglist_data {
> > #ifdef CONFIG_MEMORY_FAILURE
> > struct memory_failure_stats mf_stats;
> > #endif
> > +#ifdef CONFIG_MIGRC
> > + atomic_t migrc_pending_nr;
> > + struct llist_head migrc_reqs;
> > +#endif
> > } pg_data_t;
> >
> > #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 5c02720c53a5..1ca2ac91aa14 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -135,6 +135,9 @@ enum pageflags {
> > #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> > PG_arch_2,
> > PG_arch_3,
> > +#endif
> > +#ifdef CONFIG_MIGRC
> > + PG_migrc, /* Page has its copy under migrc's control */
>
> Well, you know that’s not great.

Yeah. I really didn't want to add this. Any idea to distinguish folios
that have their copies in migrc's pending list from others?

Migrc needs to initiate TLB flush needed when facing do_wp_page() fault
handler that chages any mapping associated to those folios under migrc's
control, from wp permission to writable.

> > #endif
> > __NR_PAGEFLAGS,
> >
> > @@ -589,6 +592,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
> > PAGEFLAG(Idle, idle, PF_ANY)
> > #endif
> >
> > +#ifdef CONFIG_MIGRC
> > +PAGEFLAG(Migrc, migrc, PF_ANY)
> > +#endif
> > +
> > /*
> > * PageReported() is used to track reported free pages within the Buddy
> > * allocator. We can use the non-atomic version of the test and set
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 63189c023357..b7c402850b2f 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1325,6 +1325,14 @@ struct task_struct {
> >
> > struct tlbflush_unmap_batch tlb_ubc;
> > struct tlbflush_unmap_batch tlb_ubc_nowr;
> > +#ifdef CONFIG_MIGRC
> > + /*
> > + * Temporarily used while a migration is on processing before
> > + * hanging to the associated numa node at the end of the
> > + * migration.
> > + */
> > + struct migrc_req *mreq;
>
> Why is this indirection needed?

To gather folios in the local context without any synchronization like
tlb_ubc. Or I need to pass the pointer to a ton of functions as an
additional parameter but it looks worse. I guess that's why tlb_ubc is
used that way too.

> > +#endif
> >
> > /* Cache last used pipe for splice(): */
> > struct pipe_inode_info *splice_pipe;
> > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> > index 1478b9dd05fa..85b3b8859bc8 100644
> > --- a/include/trace/events/mmflags.h
> > +++ b/include/trace/events/mmflags.h
> > @@ -95,6 +95,12 @@
> > #define IF_HAVE_PG_ARCH_X(_name)
> > #endif
> >
> > +#ifdef CONFIG_MIGRC
> > +#define IF_HAVE_PG_MIGRC(_name) ,{1UL << PG_##_name, __stringify(_name)}
> > +#else
> > +#define IF_HAVE_PG_MIGRC(_name)
> > +#endif
> > +
> > #define DEF_PAGEFLAG_NAME(_name) { 1UL << PG_##_name, __stringify(_name) }
> >
> > #define __def_pageflag_names \
> > @@ -125,7 +131,8 @@ IF_HAVE_PG_HWPOISON(hwpoison) \
> > IF_HAVE_PG_IDLE(idle) \
> > IF_HAVE_PG_IDLE(young) \
> > IF_HAVE_PG_ARCH_X(arch_2) \
> > -IF_HAVE_PG_ARCH_X(arch_3)
> > +IF_HAVE_PG_ARCH_X(arch_3) \
> > +IF_HAVE_PG_MIGRC(migrc)
> >
> > #define show_page_flags(flags) \
> > (flags) ? __print_flags(flags, "|", \
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 6d35728b94b2..4d88dab52059 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -908,6 +908,20 @@ config NUMA_BALANCING_DEFAULT_ENABLED
> > If set, automatic NUMA balancing will be enabled if running on a NUMA
> > machine.
> >
> > +config MIGRC
> > + bool "Deferring TLB flush by keeping read copies on migration"
> > + depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> > + depends on NUMA_BALANCING
> > + depends on 64BIT
> > + default n
> > + help
> > + TLB flush is necessary when PTE changes by migration. However,
> > + TLB flush can be deferred if both copies of the src page and
> > + the dst page are kept until TLB flush if they are non-writable.
> > + System performance will be improved especially in case that
> > + promotion and demotion types of migrations are heavily
> > + happening.
>
> I don’t think users should have any control over this feature.
>
> > +
> > menuconfig CGROUPS
> > bool "Control Group support"
> > select KERNFS
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6c264d2f969c..75dc48b6e15f 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3359,6 +3359,19 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> > if (vmf->page)
> > folio = page_folio(vmf->page);
> >
> > + /*
> > + * This folio has its read copy to prevent inconsistency while
> > + * deferring TLB flushes. However, the problem might arise if
> > + * it's going to become writable.
> > + *
> > + * To prevent it, give up the deferring TLB flushes and perform
> > + * TLB flush right away.
> > + */
> > + if (folio && migrc_pending_folio(folio)) {
> > + migrc_unpend_folio(folio);
> > + migrc_try_flush_free_folios(NULL);
>
> So many potential function calls… Probably they should have been combined
> into one and at least migrc_pending_folio() should have been an inline
> function in the header.

I will try to change it as you mention.

> > + }
> > +
>
> What about mprotect? I thought David has changed it so it can set writable
> PTEs.

I will check it out.

> > /*
> > * Shared mapping: we are guaranteed to have VM_WRITE and
> > * FAULT_FLAG_WRITE set at this point.
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 2053b54556ca..76278eca8417 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -57,6 +57,204 @@
> >
> > #include "internal.h"
> >
> > +#ifdef CONFIG_MIGRC
> > +
> > +atomic_t migrc_gen;
>
> This would overflow.

It's safe. As I mentioned, I use before() function to compare two
generation numbers.

> > +
> > +static struct migrc_req *alloc_migrc_req(void)
> > +{
> > + return kmalloc(sizeof(struct migrc_req), GFP_KERNEL);
> > +}
> > +
> > +void free_migrc_req(struct migrc_req *req)
> > +{
> > + kfree(req);
> > +}
> > +
> > +void migrc_init_page(struct page *p)
> > +{
> > + ClearPageMigrc(p);
> > +}
>
> So many wrappers for short functions. A bit of unnecessary overheads.

Hmm.. Okay. Lemme try to get rid of unnecessary wrappers.

> > +
> > +/*
> > + * One of migrc's core functions that frees up all the folios in a
> > + * migrc's pending queue.
> > + *
> > + * The list should be isolated before.
> > + */
> > +void migrc_shrink(struct llist_head *h)
>
> I am not sure by the name of it that it does what I originally thought it does
> seeing the word “shrink” (I thought it should be a memory shrinker to free memory
> when too many pages got their TLB flush deferred).
>
> I don’t think it should be exposed for someone to call it without actual TLB
> flush.

I might need a better name. I will rename it.

> > +{
> > + struct folio *f, *f2;
> > + struct llist_node *n;
> > +
> > + n = llist_del_all(h);
> > + llist_for_each_entry_safe(f, f2, n, migrc_node) {
> > + struct pglist_data *node;
> > + struct zone *zone;
> > +
> > + node = NODE_DATA(folio_nid(f));
> > + zone = page_zone(&f->page);
> > + atomic_dec(&node->migrc_pending_nr);
> > + atomic_dec(&zone->migrc_pending_nr);
>
> Why do you need both?

node->migrc_pending_nr is not used anymore in the current version. I
will remove it.

> > + folio_put(f);
> > + }
> > +}
> > +
> > +void migrc_unpend_folio(struct folio *f)
> > +{
> > + folio_clear_migrc(f);
> > +}
> > +
> > +bool migrc_pending_folio(struct folio *f)
> > +{
> > + return folio_test_migrc(f);
> > +}
> > +
> > +/*
> > + * Marks the folios as being under migrc's control.
> > + */
> > +static void migrc_mark_folios(struct folio *fsrc, struct folio *fdst)
> > +{
> > + if (!current->mreq)
> > + return;
> > +
> > + folio_get(fsrc);
> > + folio_set_migrc(fsrc);
> > + folio_set_migrc(fdst);
> > + fold_ubc_nowr_to_migrc();
> > +}
> > +
> > +static bool migrc_marked_folios(struct folio *fsrc, struct folio *fdst)
> > +{
> > + /*
> > + * XXX: The condition below is so tricky. Please suggest a
> > + * better idea to determine if it's under migrc's control.
> > + */
> > + return migrc_pending_folio(fsrc) && migrc_pending_folio(fdst);
>
> Tricky == racy?

No. I meant the way to implement is tricky.

> I do not follow the ordering that ensures a migrated page would not
> become writable before the migration is over. Perhaps a comment on what
> guarantees the order would help.

I will check it and add comments explaining it.

> > +}
> > +
> > +static void migrc_undo_folios(struct folio *fsrc, struct folio *fdst)
> > +{
> > + /*
> > + * TLB flushes needed are already done at this moment so the
> > + * flag doesn't have to be kept.
> > + */
> > + folio_clear_migrc(fsrc);
> > + folio_clear_migrc(fdst);
> > +
> > + folio_put(fsrc);
> > +}
> > +
> > +static void migrc_expand_req(struct folio *f)
> > +{
> > + struct migrc_req *req = current->mreq;
> > + struct pglist_data *node;
> > + struct zone *zone;
> > +
> > + if (!req)
> > + return;
> > +
> > + /*
> > + * Focusing on numa migration, all the nids in a req are same.
> > + */
> > + if (req->nid == -1)
> > + req->nid = folio_nid(f);
> > + else if (WARN_ON(req->nid != folio_nid(f)))
> > + return;
> > +
> > + if (llist_add(&f->migrc_node, &req->folios))
> > + req->last = &f->migrc_node;
> > +
> > + node = NODE_DATA(folio_nid(f));
> > + zone = page_zone(&f->page);
> > + atomic_inc(&node->migrc_pending_nr);
> > + atomic_inc(&zone->migrc_pending_nr);
> > +}
> > +
> > +static void migrc_unmap_done(void)
> > +{
> > + if (current->mreq)
> > + current->mreq->gen = atomic_inc_return(&migrc_gen);
> > +}
> > +
> > +/*
> > + * Prepares for gathering folios pending for TLB flushes, try to
> > + * allocate objects needed, initialize them and make them ready.
> > + */
> > +static void migrc_req_start(void)
> > +{
> > + struct migrc_req *req;
> > +
> > + if (WARN_ON(current->mreq))
> > + return;
> > +
> > + req = alloc_migrc_req();
> > + if (!req)
> > + return;
> > +
> > + arch_tlbbatch_clean(&req->arch);
> > + init_llist_head(&req->folios);
> > + req->last = NULL;
> > + req->nid = -1;
> > + current->mreq = req;
> > +}
> > +
> > +/*
> > + * Hang the request with the collected folios to 'migrc_reqs' llist of
> > + * the corresponding node.
> > + */
> > +static void migrc_req_end_if_started(void)
> > +{
> > + struct migrc_req *req = current->mreq;
> > +
> > + if (!req)
> > + return;
> > +
> > + if (llist_empty(&req->folios))
> > + free_migrc_req(req);
> > + else
> > + llist_add(&req->llnode, &NODE_DATA(req->nid)->migrc_reqs);
> > +
> > + current->mreq = NULL;
> > +}
> > +
> > +int migrc_pending_nr_in_zone(struct zone *z)
> > +{
> > + return atomic_read(&z->migrc_pending_nr);
> > +}
> > +
> > +static atomic_t migrc_stop = ATOMIC_INIT(0);
>
> Comments?

I will add comments on it.

> > +
> > +void migrc_stop_pending(void)
> > +{
> > + atomic_inc(&migrc_stop);
> > +}
> > +
> > +void migrc_resume_pending(void)
> > +{
> > + atomic_dec(&migrc_stop);
> > +}
> > +
> > +static inline bool migrc_stopped_pending(void)
> > +{
> > + return !!atomic_read(&migrc_stop);
> > +}
> > +
> > +static bool can_migrc_system(void)
> > +{
> > + return !migrc_stopped_pending();
> > +}
> > +#else
> > +static inline void migrc_mark_folios(struct folio *fsrc, struct folio *fdst) {}
> > +static inline bool migrc_marked_folios(struct folio *fsrc, struct folio *fdst) { return false; }
> > +static inline void migrc_undo_folios(struct folio *fsrc, struct folio *fdst) {}
> > +static inline void migrc_expand_req(struct folio *f) {}
> > +static inline void migrc_unmap_done(void) {}
> > +static inline void migrc_req_start(void) {}
> > +static inline void migrc_req_end_if_started(void) {}
> > +static inline bool can_migrc_system(void) { return false; }
> > +#endif
> > +
> > bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> > {
> > struct folio *folio = folio_get_nontail_page(page);
> > @@ -379,6 +577,7 @@ static int folio_expected_refs(struct address_space *mapping,
> > struct folio *folio)
> > {
> > int refs = 1;
> > +
> > if (!mapping)
> > return refs;
> >
> > @@ -406,6 +605,13 @@ int folio_migrate_mapping(struct address_space *mapping,
> > int expected_count = folio_expected_refs(mapping, folio) + extra_count;
> > long nr = folio_nr_pages(folio);
> >
> > + /*
> > + * Migrc mechanism increases the reference count to defer
> > + * freeing up folios until a better time.
> > + */
> > + if (migrc_marked_folios(folio, newfolio))
> > + expected_count++;
> > +
> > if (!mapping) {
> > /* Anonymous page without mapping */
> > if (folio_ref_count(folio) != expected_count)
> > @@ -1620,16 +1826,30 @@ static int migrate_pages_batch(struct list_head *from,
> > LIST_HEAD(unmap_folios);
> > LIST_HEAD(dst_folios);
> > bool nosplit = (reason == MR_NUMA_MISPLACED);
> > + bool migrc_cond1;
>
> Try to find better names.

Okay.

> > VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
> > !list_empty(from) && !list_is_singular(from));
> >
> > + /*
> > + * For now, apply migrc only to numa migration.
> > + */
> > + migrc_cond1 = can_migrc_system() &&
> > + ((reason == MR_DEMOTION && current_is_kswapd()) ||
> > + (reason == MR_NUMA_MISPLACED));
> > +
> > + if (migrc_cond1)
> > + migrc_req_start();
> > for (pass = 0; pass < nr_pass && retry; pass++) {
> > retry = 0;
> > thp_retry = 0;
> > nr_retry_pages = 0;
> >
> > list_for_each_entry_safe(folio, folio2, from, lru) {
> > + int nr_required;
> > + bool migrc_cond2;
> > + bool migrc;
> > +
> > is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
> > nr_pages = folio_nr_pages(folio);
> >
> > @@ -1657,9 +1877,34 @@ static int migrate_pages_batch(struct list_head *from,
> > continue;
> > }
> >
> > + /*
> > + * In case that the system is in high memory
> > + * pressure, let's stop migrc from working and
> > + * free up all the folios in the pending queues
> > + * not to impact the system.
> > + */
> > + if (unlikely(migrc_cond1 && !can_migrc_system())) {
> > + migrc_cond1 = false;
> > + migrc_req_end_if_started();
> > + migrc_try_flush_free_folios(NULL);
> > + }
> > +
> > + nr_required = nr_flush_required();
> > rc = migrate_folio_unmap(get_new_folio, put_new_folio,
> > private, folio, &dst, mode, reason,
> > ret_folios);
> > + /*
> > + * migrc_cond2 will be true only if:
> > + *
> > + * 1. There's no writable mapping at all
> > + * 2. There's read only mapping found
> > + * 3. Not already under migrc's control
> > + */
> > + migrc_cond2 = nr_required == nr_flush_required() &&
> > + nr_flush_required_nowr() &&
> > + !migrc_pending_folio(folio);
> > + migrc = migrc_cond1 && migrc_cond2;
> > +
> > /*
> > * The rules are:
> > * Success: folio will be freed
> > @@ -1720,6 +1965,9 @@ static int migrate_pages_batch(struct list_head *from,
> > case MIGRATEPAGE_UNMAP:
> > list_move_tail(&folio->lru, &unmap_folios);
> > list_add_tail(&dst->lru, &dst_folios);
> > +
> > + if (migrc)
> > + migrc_mark_folios(folio, dst);
> > break;
> > default:
> > /*
> > @@ -1733,12 +1981,24 @@ static int migrate_pages_batch(struct list_head *from,
> > stats->nr_failed_pages += nr_pages;
> > break;
> > }
> > +
> > + /*
> > + * Done with the current folio. Fold the ro
> > + * batch data gathered, to the normal batch.
> > + */
> > + fold_ubc_nowr();
> > }
> > }
> > nr_failed += retry;
> > stats->nr_thp_failed += thp_retry;
> > stats->nr_failed_pages += nr_retry_pages;
> > move:
> > + /*
> > + * Generate a new timestamp that will be used to filter out
> > + * CPUs that have already performed TLB flushes needed.
> > + */
> > + migrc_unmap_done();
> > +
> > /* Flush TLBs for all unmapped folios */
> > try_to_unmap_flush();
> >
> > @@ -1774,6 +2034,9 @@ static int migrate_pages_batch(struct list_head *from,
> > case MIGRATEPAGE_SUCCESS:
> > stats->nr_succeeded += nr_pages;
> > stats->nr_thp_succeeded += is_thp;
> > +
> > + if (migrc_marked_folios(folio, dst))
> > + migrc_expand_req(folio);
> > break;
> > default:
> > nr_failed++;
> > @@ -1791,6 +2054,8 @@ static int migrate_pages_batch(struct list_head *from,
> >
> > rc = rc_saved ? : nr_failed;
> > out:
> > + migrc_req_end_if_started();
> > +
> > /* Cleanup remaining folios */
> > dst = list_first_entry(&dst_folios, struct folio, lru);
> > dst2 = list_next_entry(dst, lru);
> > @@ -1798,6 +2063,9 @@ static int migrate_pages_batch(struct list_head *from,
> > int page_was_mapped = 0;
> > struct anon_vma *anon_vma = NULL;
> >
> > + if (migrc_marked_folios(folio, dst))
> > + migrc_undo_folios(folio, dst);
> > +
> > __migrate_folio_extract(dst, &page_was_mapped, &anon_vma);
> > migrate_folio_undo_src(folio, page_was_mapped, anon_vma,
> > true, ret_folios);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 95546f376302..2fa84c3a9681 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1535,6 +1535,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >
> > set_page_owner(page, order, gfp_flags);
> > page_table_check_alloc(page, order);
> > +
> > + for (i = 0; i != 1 << order; ++i)
> > + migrc_init_page(page + i);
>
> Do we really need an additional atomic operation to clear the page-flag here?

Of course not. I will change it.

> > }
> >
> > static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> > @@ -2839,6 +2842,8 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> > long min = mark;
> > int o;
> >
> > + free_pages += migrc_pending_nr_in_zone(z);
> > +
> > /* free_pages may go negative - that's OK */
> > free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
> >
> > @@ -2933,7 +2938,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
> > long usable_free;
> > long reserved;
> >
> > - usable_free = free_pages;
> > + usable_free = free_pages + migrc_pending_nr_in_zone(z);
> > reserved = __zone_watermark_unusable_free(z, 0, alloc_flags);
> >
> > /* reserved may over estimate high-atomic reserves. */
> > @@ -3121,6 +3126,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> > gfp_mask)) {
> > int ret;
> >
> > + /*
> > + * Free the pending folios so that the remaining
> > + * code can use them. Check zone_watermark_fast()
> > + * again with more accurate stats before going.
> > + */
> > + migrc_try_flush_free_folios(ac->nodemask);
> > + if (zone_watermark_fast(zone, order, mark,
> > + ac->highest_zoneidx, alloc_flags, gfp_mask))
> > + goto try_this_zone;
> > +
> > if (has_unaccepted_memory()) {
> > if (try_to_accept_memory(zone, order))
> > goto try_this_zone;
> > @@ -3911,6 +3926,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > unsigned int cpuset_mems_cookie;
> > unsigned int zonelist_iter_cookie;
> > int reserve_flags;
> > + bool migrc_stopped = false;
> >
> > restart:
> > compaction_retries = 0;
> > @@ -4042,6 +4058,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > if (page)
> > goto got_pg;
> >
> > + /*
> > + * The system is in very high memory pressure. Stop migrc from
> > + * expanding its pending queue and working temporarily.
> > + */
> > + if (!migrc_stopped) {
> > + migrc_stop_pending();
> > + migrc_stopped = true;
> > + }
> > +
> > /* Caller is not willing to reclaim, we can't balance anything */
> > if (!can_direct_reclaim)
> > goto nopage;
> > @@ -4169,6 +4194,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > warn_alloc(gfp_mask, ac->nodemask,
> > "page allocation failure: order:%u", order);
> > got_pg:
> > + if (migrc_stopped)
> > + migrc_resume_pending();
> > return page;
> > }
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index a045f3b57c60..d0fa7bf66e70 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -606,6 +606,82 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
> >
> > #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> >
> > +#ifdef CONFIG_MIGRC
> > +/*
> > + * Gather folios and CPUs in cpumask to handle.
> > + */
> > +static void migrc_gather(struct llist_head *folios,
> > + struct arch_tlbflush_unmap_batch *arch,
> > + struct llist_head *reqs)
> > +{
> > + struct llist_node *reqs_nodes;
> > + struct migrc_req *req;
> > + struct migrc_req *req2;
> > +
> > + reqs_nodes = llist_del_all(reqs);
> > + if (!reqs_nodes)
> > + return;
> > +
> > + /*
> > + * TODO: Optimize the time complexity.
> > + */
>
> Please mark the patch as an RFC until no more TODOs.

I removed RFC because those are longterm TODOs that are not essential.

> > + llist_for_each_entry_safe(req, req2, reqs_nodes, llnode) {
> > + struct llist_node *n;
> > +
> > + arch_migrc_adj(&req->arch, req->gen);
> > + arch_tlbbatch_fold(arch, &req->arch);
> > +
> > + n = llist_del_all(&req->folios);
> > + llist_add_batch(n, req->last, folios);
> > + free_migrc_req(req);
> > + }
> > +}
> > +
> > +bool migrc_try_flush_free_folios(nodemask_t *nodes)
> > +{
> > + struct arch_tlbflush_unmap_batch arch;
> > + LLIST_HEAD(folios);
>
> Again, I don’t understand why llist is needed.

Yep. I will change it to list_head. Thanks!

Appreciate all your great comments.

Byungchul

2023-10-30 15:59:06

by Nadav Amit

[permalink] [raw]
Subject: Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration



> On Oct 30, 2023, at 2:51 PM, Byungchul Park <[email protected]> wrote:
>
> I really spent a lot of time hesitating whether splitting it or not.
> However, the test result cannot be stable without those. I'm still
> confused. I think the test result should be stable at each commit,
> right?

Of course. You can extract the optimization we mentioned, and perhaps
have more preparatory patches.

Just a couple of comments that may also help breaking the patches:

1. The “stopping” logic is a bit not great. Try to see if you can
somehow use shrinker or OOM infrastructure instead.

2. Regarding “overflows”, it’s not always a question of whether an
overflow would happen naturally, but whether a malicious process can
trigger it.

Regards,
Nadav

2023-10-30 17:56:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [v3 0/3] Reduce TLB flushes under some specific conditions

On 10/30/23 00:25, Byungchul Park wrote:
> I'm suggesting a mechanism to reduce TLB flushes by keeping source and
> destination of folios participated in the migrations until all TLB
> flushes required are done, only if those folios are not mapped with
> write permission PTE entries at all. I worked Based on v6.6-rc5.

There's a lot of common overhead here, on top of the complexity in general:

* A new page flag
* A new cpumask_t in task_struct
* A new zone list
* Extra (temporary) memory consumption

and the benefits are ... "performance improved a little bit" on one
workload. That doesn't seem like a good overall tradeoff to me.

There will certainly be workloads that, before this patch, would have
little or no memory pressure and after this patch would need to do reclaim.

Also, looking with my arch/x86 hat on, there's really nothing
arch-specific here. Please try to keep stuff out of arch/x86 unless
it's very much arch-specific.

The connection between the arch-generic TLB flushing and
__flush_tlb_local() seems quite tenuous. __flush_tlb_local() is, to me,
quite deep in the implementation and there are quite a few ways that a
random TLB flush might not end up there. In other words, I'm not saying
that this is broken, but it's not clear at all to me how it functions
reliably.

2023-10-30 18:33:02

by Nadav Amit

[permalink] [raw]
Subject: Re: [v3 0/3] Reduce TLB flushes under some specific conditions


> On Oct 30, 2023, at 7:55 PM, Dave Hansen <[email protected]> wrote:
>
> !! External Email
>
> On 10/30/23 00:25, Byungchul Park wrote:
>> I'm suggesting a mechanism to reduce TLB flushes by keeping source and
>> destination of folios participated in the migrations until all TLB
>> flushes required are done, only if those folios are not mapped with
>> write permission PTE entries at all. I worked Based on v6.6-rc5.
>
> There's a lot of common overhead here, on top of the complexity in general:
>
> * A new page flag
> * A new cpumask_t in task_struct
> * A new zone list
> * Extra (temporary) memory consumption
>
> and the benefits are ... "performance improved a little bit" on one
> workload. That doesn't seem like a good overall tradeoff to me.

I almost forgot that I did (and embarrassingly did not follow) a TLB
flush deferring mechanism mechanism before [*], which was relatively
generic. I did not look at the migration case, but it could have been
relatively easily added - I think.

Feel free to plagiarize if you find it suitable. Note that some of
the patch-set is not relevant (e.g., 20/20 has already been fixed,
3/20 was merged.)

[*] https://lore.kernel.org/linux-mm/[email protected]/

2023-10-30 22:41:46

by Byungchul Park

[permalink] [raw]
Subject: Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

On Mon, Oct 30, 2023 at 03:58:44PM +0000, Nadav Amit wrote:
>
>
> > On Oct 30, 2023, at 2:51 PM, Byungchul Park <[email protected]> wrote:
> >
> > I really spent a lot of time hesitating whether splitting it or not.
> > However, the test result cannot be stable without those. I'm still
> > confused. I think the test result should be stable at each commit,
> > right?
>
> Of course. You can extract the optimization we mentioned, and perhaps
> have more preparatory patches.
>
> Just a couple of comments that may also help breaking the patches:
>
> 1. The “stopping” logic is a bit not great. Try to see if you can
> somehow use shrinker or OOM infrastructure instead.

The stopping means "temporarily pausing" expanding migrc's pending
queue, not shrinking folios.. Yeah my fault. I will rename it another
not to make guys get it wrong.

> 2. Regarding “overflows”, it’s not always a question of whether an
> overflow would happen naturally, but whether a malicious process can
> trigger it.

I understand what you are worried about. However, it's intended that
the variable is going to overflow. And the overflow doesn't matter if we
are aware of it and careful in handling it. See time_after() in
include/linux/jiffies.h. That would help you understand what I mean.

Byungchul
>
> Regards,
> Nadav
>

2023-10-30 23:02:29

by Byungchul Park

[permalink] [raw]
Subject: Re: [v3 0/3] Reduce TLB flushes under some specific conditions

On Mon, Oct 30, 2023 at 10:55:07AM -0700, Dave Hansen wrote:
> On 10/30/23 00:25, Byungchul Park wrote:
> > I'm suggesting a mechanism to reduce TLB flushes by keeping source and
> > destination of folios participated in the migrations until all TLB
> > flushes required are done, only if those folios are not mapped with
> > write permission PTE entries at all. I worked Based on v6.6-rc5.
>
> There's a lot of common overhead here, on top of the complexity in general:
>
> * A new page flag
> * A new cpumask_t in task_struct
> * A new zone list
> * Extra (temporary) memory consumption
>
> and the benefits are ... "performance improved a little bit" on one
> workload. That doesn't seem like a good overall tradeoff to me.
>
> There will certainly be workloads that, before this patch, would have
> little or no memory pressure and after this patch would need to do reclaim.

'if (gain - cost) > 0 ?'" is a difficult problem. I think the followings
are already big benefit in general:

1. big reduction of IPIs #
2. big reduction of TLB flushes #
3. big reduction of TLB misses #

Of course, I or we need to keep trying to see a better number in
end-to-end performance.

> Also, looking with my arch/x86 hat on, there's really nothing
> arch-specific here. Please try to keep stuff out of arch/x86 unless
> it's very much arch-specific.

Okay. I will try to keep it out of arch code. I should give up an
optimization that can be achieved by working on arch code tho.

Byungchul

2023-10-31 02:37:54

by Byungchul Park

[permalink] [raw]
Subject: Re: [v3 0/3] Reduce TLB flushes under some specific conditions

On Mon, Oct 30, 2023 at 10:55:07AM -0700, Dave Hansen wrote:
> On 10/30/23 00:25, Byungchul Park wrote:
> > I'm suggesting a mechanism to reduce TLB flushes by keeping source and
> > destination of folios participated in the migrations until all TLB
> > flushes required are done, only if those folios are not mapped with
> > write permission PTE entries at all. I worked Based on v6.6-rc5.
>
> There's a lot of common overhead here, on top of the complexity in general:
>
> * A new page flag
> * A new cpumask_t in task_struct
> * A new zone list
> * Extra (temporary) memory consumption
>
> and the benefits are ... "performance improved a little bit" on one
> workload. That doesn't seem like a good overall tradeoff to me.

I tested it under limited conditions to get stable results e.g. not to
use hyper-thread, dedicate cpu times to the test and so on. However, I'm
conviced that this patch set is more worth developing than you think it
is. Lemme share the results I've just got after changing # of CPUs
participated in the test, 16 -> 80, in the system with 80 CPUs. This is
just for your information - not that stable tho.

Byungchul

---

Architecture - x86_64
QEMU - kvm enabled, host cpu
Numa - 2 nodes (80 CPUs 1GB, no CPUs 8GB)
Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled
Benchmark - XSBench -p 50000000 (-p option makes the runtime longer)

mainline kernel
===============

The 1st try)
=====================================
Threads: 64
Runtime: 233.118 seconds
=====================================
numa_pages_migrated 758334
pgmigrate_success 1724964
nr_tlb_remote_flush 305706
nr_tlb_remote_flush_received 18598543
nr_tlb_local_flush_all 19092
nr_tlb_local_flush_one 4518717

The 2nd try)
=====================================
Threads: 64
Runtime: 221.725 seconds
=====================================
numa_pages_migrated 633209
pgmigrate_success 2156509
nr_tlb_remote_flush 261977
nr_tlb_remote_flush_received 14289256
nr_tlb_local_flush_all 11738
nr_tlb_local_flush_one 4520317

mainline kernel + migrc
=======================

The 1st try)
=====================================
Threads: 64
Runtime: 212.522 seconds
====================================
numa_pages_migrated 901264
pgmigrate_success 1990814
nr_tlb_remote_flush 151280
nr_tlb_remote_flush_received 9031376
nr_tlb_local_flush_all 21208
nr_tlb_local_flush_one 4519595

The 2nd try)
=====================================
Threads: 64
Runtime: 204.410 seconds
====================================
numa_pages_migrated 929260
pgmigrate_success 2729868
nr_tlb_remote_flush 166722
nr_tlb_remote_flush_received 8238273
nr_tlb_local_flush_all 13717
nr_tlb_local_flush_one 4519582

2023-10-31 08:48:26

by David Hildenbrand

[permalink] [raw]
Subject: Re: [v3 0/3] Reduce TLB flushes under some specific conditions

On 30.10.23 23:55, Byungchul Park wrote:
> On Mon, Oct 30, 2023 at 10:55:07AM -0700, Dave Hansen wrote:
>> On 10/30/23 00:25, Byungchul Park wrote:
>>> I'm suggesting a mechanism to reduce TLB flushes by keeping source and
>>> destination of folios participated in the migrations until all TLB
>>> flushes required are done, only if those folios are not mapped with
>>> write permission PTE entries at all. I worked Based on v6.6-rc5.
>>
>> There's a lot of common overhead here, on top of the complexity in general:
>>
>> * A new page flag
>> * A new cpumask_t in task_struct
>> * A new zone list
>> * Extra (temporary) memory consumption
>>
>> and the benefits are ... "performance improved a little bit" on one
>> workload. That doesn't seem like a good overall tradeoff to me.
>>
>> There will certainly be workloads that, before this patch, would have
>> little or no memory pressure and after this patch would need to do reclaim.
>
> 'if (gain - cost) > 0 ?'" is a difficult problem. I think the followings
> are already big benefit in general:
>
> 1. big reduction of IPIs #
> 2. big reduction of TLB flushes #
> 3. big reduction of TLB misses #
>
> Of course, I or we need to keep trying to see a better number in
> end-to-end performance.

You'll have to show convincing, real numbers, for use cases people care
about, to even motivate why people should consider looking at this in
more detail.

If you can't measure it and only speculate, nobody cares.

The numbers you provided were so far not convincing, and it's
questionable if the single benchmark you are presenting represents a
reasonable real workload that ends up improving *real* workloads. A
better description of the whole benchmark and why it represents a real
workload behavior might help.

--
Cheers,

David / dhildenb

2023-11-01 03:10:14

by Huang, Ying

[permalink] [raw]
Subject: Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

Byungchul Park <[email protected]> writes:

> On Mon, Oct 30, 2023 at 09:00:56AM +0100, David Hildenbrand wrote:
>> On 30.10.23 08:25, Byungchul Park wrote:
>> > Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
>> > We always face the migration overhead at either promotion or demotion,
>> > while working with tiered memory e.g. CXL memory and found out TLB
>> > shootdown is a quite big one that is needed to get rid of if possible.
>> >
>> > Fortunately, TLB flush can be defered or even skipped if both source and
>> > destination of folios during migration are kept until all TLB flushes
>> > required will have been done, of course, only if the target PTE entries
>> > have read only permission, more precisely speaking, don't have write
>> > permission. Otherwise, no doubt the folio might get messed up.
>> >
>> > To achieve that:
>> >
>> > 1. For the folios that map only to non-writable TLB entries, prevent
>> > TLB flush at migration by keeping both source and destination
>> > folios, which will be handled later at a better time.
>> >
>> > 2. When any non-writable TLB entry changes to writable e.g. through
>> > fault handler, give up CONFIG_MIGRC mechanism so as to perform
>> > TLB flush required right away.
>> >
>> > 3. Temporarily stop migrc from working when the system is in very
>> > high memory pressure e.g. direct reclaim needed.
>> >
>> > The measurement result:
>> >
>> > Architecture - x86_64
>> > QEMU - kvm enabled, host cpu
>> > Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
>> > Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled
>> > Benchmark - XSBench -p 50000000 (-p option makes the runtime longer)
>> >
>> > run 'perf stat' using events:
>> > 1) itlb.itlb_flush
>> > 2) tlb_flush.dtlb_thread
>> > 3) tlb_flush.stlb_any
>> > 4) dTLB-load-misses
>> > 5) dTLB-store-misses
>> > 6) iTLB-load-misses
>> >
>> > run 'cat /proc/vmstat' and pick:
>> > 1) numa_pages_migrated
>> > 2) pgmigrate_success
>> > 3) nr_tlb_remote_flush
>> > 4) nr_tlb_remote_flush_received
>> > 5) nr_tlb_local_flush_all
>> > 6) nr_tlb_local_flush_one
>> >
>> > BEFORE - mainline v6.6-rc5
>> > ------------------------------------------
>> > $ perf stat -a \
>> > -e itlb.itlb_flush \
>> > -e tlb_flush.dtlb_thread \
>> > -e tlb_flush.stlb_any \
>> > -e dTLB-load-misses \
>> > -e dTLB-store-misses \
>> > -e iTLB-load-misses \
>> > ./XSBench -p 50000000
>> >
>> > Performance counter stats for 'system wide':
>> >
>> > 20953405 itlb.itlb_flush
>> > 114886593 tlb_flush.dtlb_thread
>> > 88267015 tlb_flush.stlb_any
>> > 115304095543 dTLB-load-misses
>> > 163904743 dTLB-store-misses
>> > 608486259 iTLB-load-misses
>> >
>> > 556.787113849 seconds time elapsed
>> >
>> > $ cat /proc/vmstat
>> >
>> > ...
>> > numa_pages_migrated 3378748
>> > pgmigrate_success 7720310
>> > nr_tlb_remote_flush 751464
>> > nr_tlb_remote_flush_received 10742115
>> > nr_tlb_local_flush_all 21899
>> > nr_tlb_local_flush_one 740157
>> > ...
>> >
>> > AFTER - mainline v6.6-rc5 + CONFIG_MIGRC
>> > ------------------------------------------
>> > $ perf stat -a \
>> > -e itlb.itlb_flush \
>> > -e tlb_flush.dtlb_thread \
>> > -e tlb_flush.stlb_any \
>> > -e dTLB-load-misses \
>> > -e dTLB-store-misses \
>> > -e iTLB-load-misses \
>> > ./XSBench -p 50000000
>> >
>> > Performance counter stats for 'system wide':
>> >
>> > 4353555 itlb.itlb_flush
>> > 72482780 tlb_flush.dtlb_thread
>> > 68226458 tlb_flush.stlb_any
>> > 114331610808 dTLB-load-misses
>> > 116084771 dTLB-store-misses
>> > 377180518 iTLB-load-misses
>> >
>> > 552.667718220 seconds time elapsed
>> >
>> > $ cat /proc/vmstat
>> >
>>
>> So, an improvement of 0.74% ? How stable are the results? Serious question:
>
> I'm getting very stable result.
>
>> worth the churn?
>
> Yes, ultimately the time wise improvement should be observed. However,
> I've been focusing on the numbers of TLB flushes and TLB misses because
> better result in terms of total time will be followed depending on the
> test condition. We can see the result if we test with a system that:
>
> 1. has more CPUs that would induce a crazy number of IPIs.

FYI, the TLB flushing IPI number reduces much with commit 7e12beb8ca2a
("migrate_pages: batch flushing TLB") if multiple pages are migrated
together.

--
Best Regards,
Huang, Ying

> 2. has slow memories that makes TLB miss overhead bigger.
> 3. runs workloads that is harmful at TLB miss and IPI storm.
> 4. runs workloads that causes heavier numa migrations.
> 5. runs workloads that has a lot of read only permission mappings.
> 6. and so on.
>
> I will share the results once I manage to meet the conditions.
>
> By the way, I should've added IPI reduction because it also has super
> big delta :)
>
>> Or did I get the numbers wrong?
>>
>> > #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
>> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> > index 5c02720c53a5..1ca2ac91aa14 100644
>> > --- a/include/linux/page-flags.h
>> > +++ b/include/linux/page-flags.h
>> > @@ -135,6 +135,9 @@ enum pageflags {
>> > #ifdef CONFIG_ARCH_USES_PG_ARCH_X
>> > PG_arch_2,
>> > PG_arch_3,
>> > +#endif
>> > +#ifdef CONFIG_MIGRC
>> > + PG_migrc, /* Page has its copy under migrc's control */
>> > #endif
>> > __NR_PAGEFLAGS,
>> > @@ -589,6 +592,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
>> > PAGEFLAG(Idle, idle, PF_ANY)
>> > #endif
>> > +#ifdef CONFIG_MIGRC
>> > +PAGEFLAG(Migrc, migrc, PF_ANY)
>> > +#endif
>>
>> I assume you know this: new pageflags are frowned upon.
>
> Sorry for that. I really didn't want to add a new headache.
>
> Byungchul

2023-11-15 05:49:10

by Byungchul Park

[permalink] [raw]
Subject: Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

On Fri, Nov 10, 2023 at 10:18:06PM +0000, Nadav Amit wrote:
> > On Nov 10, 2023, at 5:13 AM, Byungchul Park <[email protected]> wrote:
> >
> >
> >> Here,
> >>
> >> mprotect()
> >> do_mprotect_pkey()
> >> tlb_finish_mmu()
> >> tlb_flush_mmu()
> >>
> >> I thought TLB flush for mprotect() is performed by tlb_flush_mmu() so
> >> any cached TLB entries on other CPUs can have chance to update. Could
> >> you correct me if I get it wrong? Thanks.
> >
> > I guess you tried to inform me that x86 mmu automatically keeps the
> > consistancy based on cached TLB entries. Right? If yes, I should do
> > something on that path. If not, it's not problematic. Thoughts?
>
> Perhaps I lost something in this whole scheme. Overall, I find it overly
> complicated and somewhat hard to follow.
>
> Ideally, a solution should reduce the number of TLB flushing mechanisms
> and not introduce yet another one that would further increase the already
> high complexity.
>
> Anyhow, I a bit convoluted 2 scenarios, and I’m not sure whether they
> are a potential problem.
>
> (1) Assume there is a RO page migration and you skipped a TLB flush.
> Then you set a RO PTE entry for that page. Afterwards, you have mprotect()
> that updates the PTE for that page to be RW.
>
> Now, tlb_finish_mmu() will do a TLB flush eventually in the mprotect()
> flow, but until it is done, you might have one CPU have RO pointing to
> the source page (no TLB flush, right?) and another having RW access

Is it possible for there to be another having RW access even before
mprotect() is done?

> that were loaded from the updated PTE. Did I miss a TLB flush that
> should take place beforehand?
>
>
> (2) Indeed we encountered many problems when TLB flushing decisions
> are based on PTEs that are read from the page-tables and those do
> not reflect the values that are held in the TLBs.

Do you have any example to help me understand what you mean?

> Overall, I think that a solution is to consolidate the TLB batching
> mechanisms and hold per a group of PTEs (e.g., 512 PTEs held in one
> page-table) the deferred TLB flush generation. Track the “done” TLB
> flush generation, if the deferred is greater than the “done”, perform
> a TLB flush.
>
> This scheme is a bit similar to what you were doing, but more general,
> and easier to follow. I think that its value might be more in simplifying
> the code and reasoning than just performance.

Worth noting that, at the beginning, I tried to reduce TLB flushes by:

1. deferring and performing TLB flushes needed in a batch,
2. skipping some TLB flushes that have already been done.

However, I gave up the 2nd optimization to keep only architecture
independent code in the patch set. So the current migrc version couldn't
skip TLB flushes but does only deferring and batching TLB flushes until
it's inevitable.

To cut a long story short, migrc let any other TLB flushes go performed
as is, *except* a special case e.i. RO mapped folios, during migration.

Byungchul