2023-10-10 08:32:22

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/6] sched/numa: Complete scanning of partial and inactive VMAs

NUMA Balancing currently uses PID fault activity within a VMA to
determine if it is worth updating PTEs to trap NUMA hinting faults.
While this is reduces overhead, it misses two important corner case.
The first is that if Task A partially scans a VMA that is active and
Task B resumes the scan but is inactive, then the remainder of the VMA
may be missed. Similarly, if a VMA is inactive for a period of time then
it may never be scanned again.

Patches 1-3 improve the documentation of the current per-VMA tracking
and adds a trace point for scan activity. Patch 4 addresses a corner
case where the PID activity information may not be reset after the
expected timeout. Patches 5-6 complete the scanning of partial and
inactive VMAs within the scan sequence.

This could be improved further but it would deserve a separate series on
top with supporting data justifying the change. Otherwise and gain/loss
due to the additional changes could be masked by this series on its own.

include/linux/mm.h | 4 +-
include/linux/mm_types.h | 36 +++++++++-
include/linux/sched/numa_balancing.h | 10 +++
include/trace/events/sched.h | 52 ++++++++++++++
kernel/sched/fair.c | 103 ++++++++++++++++++++++-----
5 files changed, 182 insertions(+), 23 deletions(-)

--
2.35.3


2023-10-10 08:32:43

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/6] sched/numa: Document vma_numab_state fields

Document the intended usage of the fields.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm_types.h | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 36c5b43999e6..0fe054afc4d6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -551,9 +551,33 @@ struct vma_lock {
};

struct vma_numab_state {
- unsigned long next_scan;
- unsigned long next_pid_reset;
- unsigned long access_pids[2];
+ unsigned long next_scan; /* Initialised as time in
+ * jiffies after which VMA
+ * should be scanned. Delays
+ * first scan of new VMA by at
+ * least
+ * sysctl_numa_balancing_scan_delay
+ */
+ unsigned long next_pid_reset; /* Time in jiffies when
+ * access_pids is reset to
+ * detect phase change
+ * behaviour.
+ */
+ unsigned long access_pids[2]; /* Approximate tracking of PIDS
+ * that trapped a NUMA hinting
+ * fault. May produce false
+ * positives due to hash
+ * collisions.
+ *
+ * [0] Previous PID tracking
+ * [1] Current PID tracking
+ *
+ * Window moves after
+ * next_pid_reset has expired
+ * approximately every
+ * VMA_PID_RESET_PERIOD
+ * jiffies.
+ */
};

/*
--
2.35.3

2023-10-10 08:32:46

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/6] sched/numa: Move up the access pid reset logic

From: Raghavendra K T <[email protected]>

Recent NUMA hinting faulting activity is reset approximately every
VMA_PID_RESET_PERIOD milliseconds. However, if the current task has not
accessed a VMA then the reset check is missed and the reset is potentially
deferred forever. Check if the PID activity information should be reset
before checking if the current task recently trapped a NUMA hinting fault.

[[email protected]: Rewrite changelog]
Suggested-by: Mel Gorman <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 17 +++++++----------
1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0535c57f6a77..05e89a7950d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3277,16 +3277,7 @@ static void task_numa_work(struct callback_head *work)
continue;
}

- /* Do not scan the VMA if task has not accessed */
- if (!vma_is_accessed(vma)) {
- trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
- continue;
- }
-
- /*
- * RESET access PIDs regularly for old VMAs. Resetting after checking
- * vma for recent access to avoid clearing PID info before access..
- */
+ /* RESET access PIDs regularly for old VMAs. */
if (mm->numa_scan_seq &&
time_after(jiffies, vma->numab_state->pids_active_reset)) {
vma->numab_state->pids_active_reset = vma->numab_state->pids_active_reset +
@@ -3295,6 +3286,12 @@ static void task_numa_work(struct callback_head *work)
vma->numab_state->pids_active[1] = 0;
}

+ /* Do not scan the VMA if task has not accessed */
+ if (!vma_is_accessed(vma)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
+ continue;
+ }
+
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
--
2.35.3

2023-10-10 08:32:57

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/6] sched/numa: Trace decisions related to skipping VMAs

NUMA Balancing skip or scans VMAs for a variety of reasons. In preparation
for completing scans of VMAs regardless of PID access, trace the reasons
why a VMA was skipped. In a later patch, the tracing will be used to track
if a VMA was forcibly scanned.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched/numa_balancing.h | 8 +++++
include/trace/events/sched.h | 50 ++++++++++++++++++++++++++++
kernel/sched/fair.c | 17 +++++++---
3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 3988762efe15..c127a1509e2f 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -15,6 +15,14 @@
#define TNF_FAULT_LOCAL 0x08
#define TNF_MIGRATE_FAIL 0x10

+enum numa_vmaskip_reason {
+ NUMAB_SKIP_UNSUITABLE,
+ NUMAB_SKIP_SHARED_RO,
+ NUMAB_SKIP_INACCESSIBLE,
+ NUMAB_SKIP_SCAN_DELAY,
+ NUMAB_SKIP_PID_INACTIVE,
+};
+
#ifdef CONFIG_NUMA_BALANCING
extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index fbb99a61f714..b0d0dbf491ea 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -664,6 +664,56 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu)
);

+#ifdef CONFIG_NUMA_BALANCING
+#define NUMAB_SKIP_REASON \
+ EM( NUMAB_SKIP_UNSUITABLE, "unsuitable" ) \
+ EM( NUMAB_SKIP_SHARED_RO, "shared_ro" ) \
+ EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \
+ EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \
+ EMe(NUMAB_SKIP_PID_INACTIVE, "pid_inactive" )
+
+/* Redefine for export. */
+#undef EM
+#undef EMe
+#define EM(a, b) TRACE_DEFINE_ENUM(a);
+#define EMe(a, b) TRACE_DEFINE_ENUM(a);
+
+NUMAB_SKIP_REASON
+
+/* Redefine for symbolic printing. */
+#undef EM
+#undef EMe
+#define EM(a, b) { a, b },
+#define EMe(a, b) { a, b }
+
+TRACE_EVENT(sched_skip_vma_numa,
+
+ TP_PROTO(struct mm_struct *mm, struct vm_area_struct *vma,
+ enum numa_vmaskip_reason reason),
+
+ TP_ARGS(mm, vma, reason),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, numa_scan_offset)
+ __field(unsigned long, vm_start)
+ __field(unsigned long, vm_end)
+ __field(enum numa_vmaskip_reason, reason)
+ ),
+
+ TP_fast_assign(
+ __entry->numa_scan_offset = mm->numa_scan_offset;
+ __entry->vm_start = vma->vm_start;
+ __entry->vm_end = vma->vm_end;
+ __entry->reason = reason;
+ ),
+
+ TP_printk("numa_scan_offset=%lX vm_start=%lX vm_end=%lX reason=%s",
+ __entry->numa_scan_offset,
+ __entry->vm_start,
+ __entry->vm_end,
+ __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
+);
+#endif /* CONFIG_NUMA_BALANCING */

/*
* Tracepoint for waking a polling cpu without an IPI.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81405627b9ed..0535c57f6a77 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3227,6 +3227,7 @@ static void task_numa_work(struct callback_head *work)
do {
if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
continue;
}

@@ -3237,15 +3238,19 @@ static void task_numa_work(struct callback_head *work)
* as migrating the pages will be of marginal benefit.
*/
if (!vma->vm_mm ||
- (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+ (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ))) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SHARED_RO);
continue;
+ }

/*
* Skip inaccessible VMAs to avoid any confusion between
* PROT_NONE and NUMA hinting ptes
*/
- if (!vma_is_accessible(vma))
+ if (!vma_is_accessible(vma)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_INACCESSIBLE);
continue;
+ }

/* Initialise new per-VMA NUMAB state. */
if (!vma->numab_state) {
@@ -3267,12 +3272,16 @@ static void task_numa_work(struct callback_head *work)
* delay the scan for new VMAs.
*/
if (mm->numa_scan_seq && time_before(jiffies,
- vma->numab_state->next_scan))
+ vma->numab_state->next_scan)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SCAN_DELAY);
continue;
+ }

/* Do not scan the VMA if task has not accessed */
- if (!vma_is_accessed(vma))
+ if (!vma_is_accessed(vma)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
continue;
+ }

/*
* RESET access PIDs regularly for old VMAs. Resetting after checking
--
2.35.3

2023-10-10 08:33:03

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/6] sched/numa: Complete scanning of partial VMAs regardless of PID activity

NUMA Balancing skips VMAs when the current task has not trapped a NUMA
fault within the VMA. If the VMA is skipped then mm->numa_scan_offset
advances and a task that is trapping faults within the VMA may never
fully update PTEs within the VMA.

Force tasks to update PTEs for partially scanned PTEs. The VMA will
be tagged for NUMA hints by some task but this removes some of the
benefit of tracking PID activity within a VMA. A follow-on patch
will mitigate this problem.

The test cases and machines evaluated did not trigger the corner case so
the performance results are neutral with only small changes within the
noise from normal test-to-test variance. However, the next patch makes
the corner case easier to trigger.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched/numa_balancing.h | 1 +
include/trace/events/sched.h | 3 ++-
kernel/sched/fair.c | 18 +++++++++++++++---
3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index c127a1509e2f..7dcc0bdfddbb 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -21,6 +21,7 @@ enum numa_vmaskip_reason {
NUMAB_SKIP_INACCESSIBLE,
NUMAB_SKIP_SCAN_DELAY,
NUMAB_SKIP_PID_INACTIVE,
+ NUMAB_SKIP_IGNORE_PID,
};

#ifdef CONFIG_NUMA_BALANCING
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index b0d0dbf491ea..27b51c81b106 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -670,7 +670,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
EM( NUMAB_SKIP_SHARED_RO, "shared_ro" ) \
EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \
EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \
- EMe(NUMAB_SKIP_PID_INACTIVE, "pid_inactive" )
+ EM( NUMAB_SKIP_PID_INACTIVE, "pid_inactive" ) \
+ EMe(NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" )

/* Redefine for export. */
#undef EM
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 05e89a7950d0..150f01948ec6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3130,7 +3130,7 @@ static void reset_ptenuma_scan(struct task_struct *p)
p->mm->numa_scan_offset = 0;
}

-static bool vma_is_accessed(struct vm_area_struct *vma)
+static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
{
unsigned long pids;
/*
@@ -3143,7 +3143,19 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
return true;

pids = vma->numab_state->pids_active[0] | vma->numab_state->pids_active[1];
- return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
+ if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
+ return true;
+
+ /*
+ * Complete a scan that has already started regardless of PID access or
+ * some VMAs may never be scanned in multi-threaded applications
+ */
+ if (mm->numa_scan_offset > vma->vm_start) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_IGNORE_PID);
+ return true;
+ }
+
+ return false;
}

#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
@@ -3287,7 +3299,7 @@ static void task_numa_work(struct callback_head *work)
}

/* Do not scan the VMA if task has not accessed */
- if (!vma_is_accessed(vma)) {
+ if (!vma_is_accessed(mm, vma)) {
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
continue;
}
--
2.35.3

2023-10-10 08:33:19

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative

VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.

Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.

The VMA skipping activity frequency with and without the patch is as
follows;

6.6.0-rc2-sched-numabtrace-v1
649 reason=scan_delay
9094 reason=unsuitable
48915 reason=shared_ro
143919 reason=inaccessible
193050 reason=pid_inactive

6.6.0-rc2-sched-numabselective-v1
146 reason=seq_completed
622 reason=ignore_pid_inactive
624 reason=scan_delay
6570 reason=unsuitable
16101 reason=shared_ro
27608 reason=inaccessible
41939 reason=pid_inactive

Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.

On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;

6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%)
Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%*
Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%)
CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%)
Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%)

The time to complete the workload is reduced by almost 30%

6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1 /
Duration User 91201.80 63506.64
Duration System 2015.53 1819.78
Duration Elapsed 1234.77 868.37

In this specific case, system CPU time was not increased but it's not
universally true.

From vmstat, the NUMA scanning and fault activity is as follows;

6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Ops NUMA base-page range updates 64272.00 26374386.00
Ops NUMA PTE updates 36624.00 55538.00
Ops NUMA PMD updates 54.00 51404.00
Ops NUMA hint faults 15504.00 75786.00
Ops NUMA hint local faults % 14860.00 56763.00
Ops NUMA hint local percent 95.85 74.90
Ops NUMA pages migrated 1629.00 6469222.00

Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm_types.h | 6 +++
include/linux/sched/numa_balancing.h | 1 +
include/trace/events/sched.h | 3 +-
kernel/sched/fair.c | 55 ++++++++++++++++++++++++++--
4 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8cb1dec3e358..a123c1a58617 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -578,6 +578,12 @@ struct vma_numab_state {
* VMA_PID_RESET_PERIOD
* jiffies.
*/
+ int prev_scan_seq; /* MM scan sequence ID when
+ * the VMA was last completely
+ * scanned. A VMA is not
+ * eligible for scanning if
+ * prev_scan_seq == numa_scan_seq
+ */
};

/*
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 7dcc0bdfddbb..b69afb8630db 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -22,6 +22,7 @@ enum numa_vmaskip_reason {
NUMAB_SKIP_SCAN_DELAY,
NUMAB_SKIP_PID_INACTIVE,
NUMAB_SKIP_IGNORE_PID,
+ NUMAB_SKIP_SEQ_COMPLETED,
};

#ifdef CONFIG_NUMA_BALANCING
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 27b51c81b106..010ba1b7cb0e 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \
EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \
EM( NUMAB_SKIP_PID_INACTIVE, "pid_inactive" ) \
- EMe(NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" )
+ EM( NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" ) \
+ EMe(NUMAB_SKIP_SEQ_COMPLETED, "seq_completed" )

/* Redefine for export. */
#undef EM
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 150f01948ec6..72ef60f394ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3175,6 +3175,8 @@ static void task_numa_work(struct callback_head *work)
unsigned long nr_pte_updates = 0;
long pages, virtpages;
struct vma_iterator vmi;
+ bool vma_pids_skipped;
+ bool vma_pids_forced = false;

SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));

@@ -3217,7 +3219,6 @@ static void task_numa_work(struct callback_head *work)
*/
p->node_stamp += 2 * TICK_NSEC;

- start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
virtpages = pages * 8; /* Scan up to this much virtual space */
@@ -3227,6 +3228,16 @@ static void task_numa_work(struct callback_head *work)

if (!mmap_read_trylock(mm))
return;
+
+ /*
+ * VMAs are skipped if the current PID has not trapped a fault within
+ * the VMA recently. Allow scanning to be forced if there is no
+ * suitable VMA remaining.
+ */
+ vma_pids_skipped = false;
+
+retry_pids:
+ start = mm->numa_scan_offset;
vma_iter_init(&vmi, mm, start);
vma = vma_next(&vmi);
if (!vma) {
@@ -3277,6 +3288,13 @@ static void task_numa_work(struct callback_head *work)
/* Reset happens after 4 times scan delay of scan start */
vma->numab_state->pids_active_reset = vma->numab_state->next_scan +
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+ /*
+ * Ensure prev_scan_seq does not match numa_scan_seq
+ * to prevent VMAs being skipped prematurely on the
+ * first scan.
+ */
+ vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;
}

/*
@@ -3298,8 +3316,19 @@ static void task_numa_work(struct callback_head *work)
vma->numab_state->pids_active[1] = 0;
}

- /* Do not scan the VMA if task has not accessed */
- if (!vma_is_accessed(mm, vma)) {
+ /* Do not rescan VMAs twice within the same sequence. */
+ if (vma->numab_state->prev_scan_seq == mm->numa_scan_seq) {
+ mm->numa_scan_offset = vma->vm_end;
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SEQ_COMPLETED);
+ continue;
+ }
+
+ /*
+ * Do not scan the VMA if task has not accessed unless no other
+ * VMA candidate exists.
+ */
+ if (!vma_pids_forced && !vma_is_accessed(mm, vma)) {
+ vma_pids_skipped = true;
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
continue;
}
@@ -3328,8 +3357,28 @@ static void task_numa_work(struct callback_head *work)

cond_resched();
} while (end != vma->vm_end);
+
+ /* VMA scan is complete, do not scan until next sequence. */
+ vma->numab_state->prev_scan_seq = mm->numa_scan_seq;
+
+ /*
+ * Only force scan within one VMA at a time to limit the
+ * cost of scanning a potentially uninteresting VMA.
+ */
+ if (vma_pids_forced)
+ break;
} for_each_vma(vmi, vma);

+ /*
+ * If no VMAs are remaining and VMAs were skipped due to the PID
+ * not accessing the VMA previously then force a scan to ensure
+ * forward progress.
+ */
+ if (!vma && !vma_pids_forced && vma_pids_skipped) {
+ vma_pids_forced = true;
+ goto retry_pids;
+ }
+
out:
/*
* It is possible to reach the end of the VMA list but the last few
--
2.35.3

2023-10-10 08:41:57

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/6] sched/numa: Rename vma_numab_state.access_pids

The access_pids field name is somewhat ambiguous as no PIDs are accessed.
Similarly, it's not clear that next_pid_reset is related to access_pids.
Rename the fields to more accurately reflect their purpose.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm.h | 4 ++--
include/linux/mm_types.h | 4 ++--
kernel/sched/fair.c | 12 ++++++------
3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..19fc73b02c9f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1726,8 +1726,8 @@ static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
unsigned int pid_bit;

pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG));
- if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->access_pids[1])) {
- __set_bit(pid_bit, &vma->numab_state->access_pids[1]);
+ if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->pids_active[1])) {
+ __set_bit(pid_bit, &vma->numab_state->pids_active[1]);
}
}
#else /* !CONFIG_NUMA_BALANCING */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0fe054afc4d6..8cb1dec3e358 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -558,12 +558,12 @@ struct vma_numab_state {
* least
* sysctl_numa_balancing_scan_delay
*/
- unsigned long next_pid_reset; /* Time in jiffies when
+ unsigned long pids_active_reset; /* Time in jiffies when
* access_pids is reset to
* detect phase change
* behaviour.
*/
- unsigned long access_pids[2]; /* Approximate tracking of PIDS
+ unsigned long pids_active[2]; /* Approximate tracking of PIDS
* that trapped a NUMA hinting
* fault. May produce false
* positives due to hash
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cb225921bbca..81405627b9ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3142,7 +3142,7 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
if (READ_ONCE(current->mm->numa_scan_seq) < 2)
return true;

- pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
+ pids = vma->numab_state->pids_active[0] | vma->numab_state->pids_active[1];
return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
}

@@ -3258,7 +3258,7 @@ static void task_numa_work(struct callback_head *work)
msecs_to_jiffies(sysctl_numa_balancing_scan_delay);

/* Reset happens after 4 times scan delay of scan start */
- vma->numab_state->next_pid_reset = vma->numab_state->next_scan +
+ vma->numab_state->pids_active_reset = vma->numab_state->next_scan +
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
}

@@ -3279,11 +3279,11 @@ static void task_numa_work(struct callback_head *work)
* vma for recent access to avoid clearing PID info before access..
*/
if (mm->numa_scan_seq &&
- time_after(jiffies, vma->numab_state->next_pid_reset)) {
- vma->numab_state->next_pid_reset = vma->numab_state->next_pid_reset +
+ time_after(jiffies, vma->numab_state->pids_active_reset)) {
+ vma->numab_state->pids_active_reset = vma->numab_state->pids_active_reset +
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
- vma->numab_state->access_pids[0] = READ_ONCE(vma->numab_state->access_pids[1]);
- vma->numab_state->access_pids[1] = 0;
+ vma->numab_state->pids_active[0] = READ_ONCE(vma->numab_state->pids_active[1]);
+ vma->numab_state->pids_active[1] = 0;
}

do {
--
2.35.3

2023-10-10 09:24:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative


* Mel Gorman <[email protected]> wrote:

> On a 2-socket Cascade Lake test machine, the time to complete the
> workload is as follows;
>
> 6.6.0-rc2 6.6.0-rc2
> sched-numabtrace-v1 sched-numabselective-v1
> Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%)
> Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%*
> Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%)
> CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%)
> Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%)
>
> The time to complete the workload is reduced by almost 30%
>
> 6.6.0-rc2 6.6.0-rc2
> sched-numabtrace-v1 sched-numabselective-v1 /
> Duration User 91201.80 63506.64
> Duration System 2015.53 1819.78
> Duration Elapsed 1234.77 868.37
>
> In this specific case, system CPU time was not increased but it's not
> universally true.
>
> From vmstat, the NUMA scanning and fault activity is as follows;
>
> 6.6.0-rc2 6.6.0-rc2
> sched-numabtrace-v1 sched-numabselective-v1
> Ops NUMA base-page range updates 64272.00 26374386.00
> Ops NUMA PTE updates 36624.00 55538.00
> Ops NUMA PMD updates 54.00 51404.00
> Ops NUMA hint faults 15504.00 75786.00
> Ops NUMA hint local faults % 14860.00 56763.00
> Ops NUMA hint local percent 95.85 74.90
> Ops NUMA pages migrated 1629.00 6469222.00
>
> Both the number of PTE updates and hint faults is dramatically
> increased. While this is superficially unfortunate, it represents
> ranges that were simply skipped without the patch. As a result
> of the scanning and hinting faults, many more pages were also
> migrated but as the time to completion is reduced, the overhead
> is offset by the gain.

Nice! I've applied your series to tip:sched/core with a few non-functional
edits to comment/changelog formatting/clarity.

Btw., was any previous analysis done on the size of the pids_active[] hash
and the hash collision rate?

64 (BITS_PER_LONG) feels a bit small, especially on larger machines running
threaded workloads, and the kmalloc of numab_state likely allocates a full
cacheline anyway, so we could double the hash size from 8 bytes (2x1 longs)
to 32 bytes (2x2 longs) with very little real cost, and still have a long
field left to spare?

Thanks,

Ingo

Subject: [tip: sched/core] sched/numa: Trace decisions related to skipping VMAs

The following commit has been merged into the sched/core branch of tip:

Commit-ID: ed2da8b725b932b1e2b2f4835bb664d47ed03031
Gitweb: https://git.kernel.org/tip/ed2da8b725b932b1e2b2f4835bb664d47ed03031
Author: Mel Gorman <[email protected]>
AuthorDate: Tue, 10 Oct 2023 09:31:40 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 10 Oct 2023 11:10:00 +02:00

sched/numa: Trace decisions related to skipping VMAs

NUMA balancing skips or scans VMAs for a variety of reasons. In preparation
for completing scans of VMAs regardless of PID access, trace the reasons
why a VMA was skipped. In a later patch, the tracing will be used to track
if a VMA was forcibly scanned.

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/sched/numa_balancing.h | 8 ++++-
include/trace/events/sched.h | 50 +++++++++++++++++++++++++++-
kernel/sched/fair.c | 17 ++++++---
3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 3988762..c127a15 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -15,6 +15,14 @@
#define TNF_FAULT_LOCAL 0x08
#define TNF_MIGRATE_FAIL 0x10

+enum numa_vmaskip_reason {
+ NUMAB_SKIP_UNSUITABLE,
+ NUMAB_SKIP_SHARED_RO,
+ NUMAB_SKIP_INACCESSIBLE,
+ NUMAB_SKIP_SCAN_DELAY,
+ NUMAB_SKIP_PID_INACTIVE,
+};
+
#ifdef CONFIG_NUMA_BALANCING
extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index a13d5d0..d82a04d 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -664,6 +664,56 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu)
);

+#ifdef CONFIG_NUMA_BALANCING
+#define NUMAB_SKIP_REASON \
+ EM( NUMAB_SKIP_UNSUITABLE, "unsuitable" ) \
+ EM( NUMAB_SKIP_SHARED_RO, "shared_ro" ) \
+ EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \
+ EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \
+ EMe(NUMAB_SKIP_PID_INACTIVE, "pid_inactive" )
+
+/* Redefine for export. */
+#undef EM
+#undef EMe
+#define EM(a, b) TRACE_DEFINE_ENUM(a);
+#define EMe(a, b) TRACE_DEFINE_ENUM(a);
+
+NUMAB_SKIP_REASON
+
+/* Redefine for symbolic printing. */
+#undef EM
+#undef EMe
+#define EM(a, b) { a, b },
+#define EMe(a, b) { a, b }
+
+TRACE_EVENT(sched_skip_vma_numa,
+
+ TP_PROTO(struct mm_struct *mm, struct vm_area_struct *vma,
+ enum numa_vmaskip_reason reason),
+
+ TP_ARGS(mm, vma, reason),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, numa_scan_offset)
+ __field(unsigned long, vm_start)
+ __field(unsigned long, vm_end)
+ __field(enum numa_vmaskip_reason, reason)
+ ),
+
+ TP_fast_assign(
+ __entry->numa_scan_offset = mm->numa_scan_offset;
+ __entry->vm_start = vma->vm_start;
+ __entry->vm_end = vma->vm_end;
+ __entry->reason = reason;
+ ),
+
+ TP_printk("numa_scan_offset=%lX vm_start=%lX vm_end=%lX reason=%s",
+ __entry->numa_scan_offset,
+ __entry->vm_start,
+ __entry->vm_end,
+ __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
+);
+#endif /* CONFIG_NUMA_BALANCING */

/*
* Tracepoint for waking a polling cpu without an IPI.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b47edc..31cfdb0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3210,6 +3210,7 @@ static void task_numa_work(struct callback_head *work)
do {
if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
continue;
}

@@ -3220,15 +3221,19 @@ static void task_numa_work(struct callback_head *work)
* as migrating the pages will be of marginal benefit.
*/
if (!vma->vm_mm ||
- (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+ (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ))) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SHARED_RO);
continue;
+ }

/*
* Skip inaccessible VMAs to avoid any confusion between
* PROT_NONE and NUMA hinting ptes
*/
- if (!vma_is_accessible(vma))
+ if (!vma_is_accessible(vma)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_INACCESSIBLE);
continue;
+ }

/* Initialise new per-VMA NUMAB state. */
if (!vma->numab_state) {
@@ -3250,12 +3255,16 @@ static void task_numa_work(struct callback_head *work)
* delay the scan for new VMAs.
*/
if (mm->numa_scan_seq && time_before(jiffies,
- vma->numab_state->next_scan))
+ vma->numab_state->next_scan)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SCAN_DELAY);
continue;
+ }

/* Do not scan the VMA if task has not accessed */
- if (!vma_is_accessed(vma))
+ if (!vma_is_accessed(vma)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
continue;
+ }

/*
* RESET access PIDs regularly for old VMAs. Resetting after checking

Subject: [tip: sched/core] sched/numa: Complete scanning of inactive VMAs when there is no alternative

The following commit has been merged into the sched/core branch of tip:

Commit-ID: ef4e0604996f5376e2fc044b2d7b56f4a6cc697e
Gitweb: https://git.kernel.org/tip/ef4e0604996f5376e2fc044b2d7b56f4a6cc697e
Author: Mel Gorman <[email protected]>
AuthorDate: Tue, 10 Oct 2023 09:31:43 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 10 Oct 2023 11:10:01 +02:00

sched/numa: Complete scanning of inactive VMAs when there is no alternative

VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.

Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.

The VMA skipping activity frequency with and without the patch:

6.6.0-rc2-sched-numabtrace-v1
=============================
649 reason=scan_delay
9,094 reason=unsuitable
48,915 reason=shared_ro
143,919 reason=inaccessible
193,050 reason=pid_inactive

6.6.0-rc2-sched-numabselective-v1
=============================
146 reason=seq_completed
622 reason=ignore_pid_inactive

624 reason=scan_delay
6,570 reason=unsuitable
16,101 reason=shared_ro
27,608 reason=inaccessible
41,939 reason=pid_inactive

Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.

On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;

6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%)
Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%*
Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%)
CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%)
Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%)

The time to complete the workload is reduced by almost 30%:

6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1 /
Duration User 91201.80 63506.64
Duration System 2015.53 1819.78
Duration Elapsed 1234.77 868.37

In this specific case, system CPU time was not increased but it's not
universally true.

>From vmstat, the NUMA scanning and fault activity is as follows;

6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Ops NUMA base-page range updates 64272.00 26374386.00
Ops NUMA PTE updates 36624.00 55538.00
Ops NUMA PMD updates 54.00 51404.00
Ops NUMA hint faults 15504.00 75786.00
Ops NUMA hint local faults % 14860.00 56763.00
Ops NUMA hint local percent 95.85 74.90
Ops NUMA pages migrated 1629.00 6469222.00

Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/mm_types.h | 6 +++-
include/linux/sched/numa_balancing.h | 1 +-
include/trace/events/sched.h | 3 +-
kernel/sched/fair.c | 55 +++++++++++++++++++++++++--
4 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e7571ec..589f31e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -575,6 +575,12 @@ struct vma_numab_state {
* every VMA_PID_RESET_PERIOD jiffies:
*/
unsigned long pids_active[2];
+
+ /*
+ * MM scan sequence ID when the VMA was last completely scanned.
+ * A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
+ */
+ int prev_scan_seq;
};

/*
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 7dcc0bd..b69afb8 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -22,6 +22,7 @@ enum numa_vmaskip_reason {
NUMAB_SKIP_SCAN_DELAY,
NUMAB_SKIP_PID_INACTIVE,
NUMAB_SKIP_IGNORE_PID,
+ NUMAB_SKIP_SEQ_COMPLETED,
};

#ifdef CONFIG_NUMA_BALANCING
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index bfc07c1..6188ad0 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \
EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \
EM( NUMAB_SKIP_PID_INACTIVE, "pid_inactive" ) \
- EMe(NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" )
+ EM( NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" ) \
+ EMe(NUMAB_SKIP_SEQ_COMPLETED, "seq_completed" )

/* Redefine for export. */
#undef EM
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab79013..9229051 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3158,6 +3158,8 @@ static void task_numa_work(struct callback_head *work)
unsigned long nr_pte_updates = 0;
long pages, virtpages;
struct vma_iterator vmi;
+ bool vma_pids_skipped;
+ bool vma_pids_forced = false;

SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));

@@ -3200,7 +3202,6 @@ static void task_numa_work(struct callback_head *work)
*/
p->node_stamp += 2 * TICK_NSEC;

- start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
virtpages = pages * 8; /* Scan up to this much virtual space */
@@ -3210,6 +3211,16 @@ static void task_numa_work(struct callback_head *work)

if (!mmap_read_trylock(mm))
return;
+
+ /*
+ * VMAs are skipped if the current PID has not trapped a fault within
+ * the VMA recently. Allow scanning to be forced if there is no
+ * suitable VMA remaining.
+ */
+ vma_pids_skipped = false;
+
+retry_pids:
+ start = mm->numa_scan_offset;
vma_iter_init(&vmi, mm, start);
vma = vma_next(&vmi);
if (!vma) {
@@ -3260,6 +3271,13 @@ static void task_numa_work(struct callback_head *work)
/* Reset happens after 4 times scan delay of scan start */
vma->numab_state->pids_active_reset = vma->numab_state->next_scan +
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+ /*
+ * Ensure prev_scan_seq does not match numa_scan_seq,
+ * to prevent VMAs being skipped prematurely on the
+ * first scan:
+ */
+ vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;
}

/*
@@ -3281,8 +3299,19 @@ static void task_numa_work(struct callback_head *work)
vma->numab_state->pids_active[1] = 0;
}

- /* Do not scan the VMA if task has not accessed */
- if (!vma_is_accessed(mm, vma)) {
+ /* Do not rescan VMAs twice within the same sequence. */
+ if (vma->numab_state->prev_scan_seq == mm->numa_scan_seq) {
+ mm->numa_scan_offset = vma->vm_end;
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SEQ_COMPLETED);
+ continue;
+ }
+
+ /*
+ * Do not scan the VMA if task has not accessed it, unless no other
+ * VMA candidate exists.
+ */
+ if (!vma_pids_forced && !vma_is_accessed(mm, vma)) {
+ vma_pids_skipped = true;
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
continue;
}
@@ -3311,8 +3340,28 @@ static void task_numa_work(struct callback_head *work)

cond_resched();
} while (end != vma->vm_end);
+
+ /* VMA scan is complete, do not scan until next sequence. */
+ vma->numab_state->prev_scan_seq = mm->numa_scan_seq;
+
+ /*
+ * Only force scan within one VMA at a time, to limit the
+ * cost of scanning a potentially uninteresting VMA.
+ */
+ if (vma_pids_forced)
+ break;
} for_each_vma(vmi, vma);

+ /*
+ * If no VMAs are remaining and VMAs were skipped due to the PID
+ * not accessing the VMA previously, then force a scan to ensure
+ * forward progress:
+ */
+ if (!vma && !vma_pids_forced && vma_pids_skipped) {
+ vma_pids_forced = true;
+ goto retry_pids;
+ }
+
out:
/*
* It is possible to reach the end of the VMA list but the last few

Subject: [tip: sched/core] sched/numa: Rename vma_numab_state::access_pids[] => ::pids_active[], ::next_pid_reset => ::pids_active_reset

The following commit has been merged into the sched/core branch of tip:

Commit-ID: f3a6c97940fbd25d6c84c2d5642338fc99a9b35b
Gitweb: https://git.kernel.org/tip/f3a6c97940fbd25d6c84c2d5642338fc99a9b35b
Author: Mel Gorman <[email protected]>
AuthorDate: Tue, 10 Oct 2023 09:31:39 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 10 Oct 2023 11:10:00 +02:00

sched/numa: Rename vma_numab_state::access_pids[] => ::pids_active[], ::next_pid_reset => ::pids_active_reset

The access_pids[] field name is somewhat ambiguous as no PIDs are accessed.
Similarly, it's not clear that next_pid_reset is related to access_pids[].
Rename the fields to more accurately reflect their purpose.

[ mingo: Rename in the comments too. ]

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/mm.h | 4 ++--
include/linux/mm_types.h | 6 +++---
kernel/sched/fair.c | 12 ++++++------
3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1..19fc73b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1726,8 +1726,8 @@ static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
unsigned int pid_bit;

pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG));
- if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->access_pids[1])) {
- __set_bit(pid_bit, &vma->numab_state->access_pids[1]);
+ if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->pids_active[1])) {
+ __set_bit(pid_bit, &vma->numab_state->pids_active[1]);
}
}
#else /* !CONFIG_NUMA_BALANCING */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d7f042e..e7571ec 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -559,10 +559,10 @@ struct vma_numab_state {
unsigned long next_scan;

/*
- * Time in jiffies when access_pids[] is reset to
+ * Time in jiffies when pids_active[] is reset to
* detect phase change behaviour:
*/
- unsigned long next_pid_reset;
+ unsigned long pids_active_reset;

/*
* Approximate tracking of PIDs that trapped a NUMA hinting
@@ -574,7 +574,7 @@ struct vma_numab_state {
* Window moves after next_pid_reset has expired approximately
* every VMA_PID_RESET_PERIOD jiffies:
*/
- unsigned long access_pids[2];
+ unsigned long pids_active[2];
};

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7c1baf..6b47edc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3125,7 +3125,7 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
if (READ_ONCE(current->mm->numa_scan_seq) < 2)
return true;

- pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
+ pids = vma->numab_state->pids_active[0] | vma->numab_state->pids_active[1];
return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
}

@@ -3241,7 +3241,7 @@ static void task_numa_work(struct callback_head *work)
msecs_to_jiffies(sysctl_numa_balancing_scan_delay);

/* Reset happens after 4 times scan delay of scan start */
- vma->numab_state->next_pid_reset = vma->numab_state->next_scan +
+ vma->numab_state->pids_active_reset = vma->numab_state->next_scan +
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
}

@@ -3262,11 +3262,11 @@ static void task_numa_work(struct callback_head *work)
* vma for recent access to avoid clearing PID info before access..
*/
if (mm->numa_scan_seq &&
- time_after(jiffies, vma->numab_state->next_pid_reset)) {
- vma->numab_state->next_pid_reset = vma->numab_state->next_pid_reset +
+ time_after(jiffies, vma->numab_state->pids_active_reset)) {
+ vma->numab_state->pids_active_reset = vma->numab_state->pids_active_reset +
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
- vma->numab_state->access_pids[0] = READ_ONCE(vma->numab_state->access_pids[1]);
- vma->numab_state->access_pids[1] = 0;
+ vma->numab_state->pids_active[0] = READ_ONCE(vma->numab_state->pids_active[1]);
+ vma->numab_state->pids_active[1] = 0;
}

do {

Subject: [tip: sched/core] sched/numa: Complete scanning of partial VMAs regardless of PID activity

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 8dd6cab1a5e96f1e4ac4969d2e5575082caab20d
Gitweb: https://git.kernel.org/tip/8dd6cab1a5e96f1e4ac4969d2e5575082caab20d
Author: Mel Gorman <[email protected]>
AuthorDate: Tue, 10 Oct 2023 09:31:42 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 10 Oct 2023 11:10:01 +02:00

sched/numa: Complete scanning of partial VMAs regardless of PID activity

NUMA Balancing skips VMAs when the current task has not trapped a NUMA
fault within the VMA. If the VMA is skipped then mm->numa_scan_offset
advances and a task that is trapping faults within the VMA may never
fully update PTEs within the VMA.

Force tasks to update PTEs for partially scanned PTEs. The VMA will
be tagged for NUMA hints by some task but this removes some of the
benefit of tracking PID activity within a VMA. A follow-on patch
will mitigate this problem.

The test cases and machines evaluated did not trigger the corner case so
the performance results are neutral with only small changes within the
noise from normal test-to-test variance. However, the next patch makes
the corner case easier to trigger.

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/sched/numa_balancing.h | 1 +
include/trace/events/sched.h | 3 ++-
kernel/sched/fair.c | 18 +++++++++++++++---
3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index c127a15..7dcc0bd 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -21,6 +21,7 @@ enum numa_vmaskip_reason {
NUMAB_SKIP_INACCESSIBLE,
NUMAB_SKIP_SCAN_DELAY,
NUMAB_SKIP_PID_INACTIVE,
+ NUMAB_SKIP_IGNORE_PID,
};

#ifdef CONFIG_NUMA_BALANCING
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index d82a04d..bfc07c1 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -670,7 +670,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
EM( NUMAB_SKIP_SHARED_RO, "shared_ro" ) \
EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \
EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \
- EMe(NUMAB_SKIP_PID_INACTIVE, "pid_inactive" )
+ EM( NUMAB_SKIP_PID_INACTIVE, "pid_inactive" ) \
+ EMe(NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" )

/* Redefine for export. */
#undef EM
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce36969..ab79013 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3113,7 +3113,7 @@ static void reset_ptenuma_scan(struct task_struct *p)
p->mm->numa_scan_offset = 0;
}

-static bool vma_is_accessed(struct vm_area_struct *vma)
+static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
{
unsigned long pids;
/*
@@ -3126,7 +3126,19 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
return true;

pids = vma->numab_state->pids_active[0] | vma->numab_state->pids_active[1];
- return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
+ if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
+ return true;
+
+ /*
+ * Complete a scan that has already started regardless of PID access, or
+ * some VMAs may never be scanned in multi-threaded applications:
+ */
+ if (mm->numa_scan_offset > vma->vm_start) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_IGNORE_PID);
+ return true;
+ }
+
+ return false;
}

#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
@@ -3270,7 +3282,7 @@ static void task_numa_work(struct callback_head *work)
}

/* Do not scan the VMA if task has not accessed */
- if (!vma_is_accessed(vma)) {
+ if (!vma_is_accessed(mm, vma)) {
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
continue;
}

Subject: [tip: sched/core] sched/numa: Move up the access pid reset logic

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 2e2675db1906ac04809f5399bf1f5e30d56a6f3e
Gitweb: https://git.kernel.org/tip/2e2675db1906ac04809f5399bf1f5e30d56a6f3e
Author: Raghavendra K T <[email protected]>
AuthorDate: Tue, 10 Oct 2023 09:31:41 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 10 Oct 2023 11:10:01 +02:00

sched/numa: Move up the access pid reset logic

Recent NUMA hinting faulting activity is reset approximately every
VMA_PID_RESET_PERIOD milliseconds. However, if the current task has not
accessed a VMA then the reset check is missed and the reset is potentially
deferred forever. Check if the PID activity information should be reset
before checking if the current task recently trapped a NUMA hinting fault.

[ [email protected]: Rewrite changelog ]

Suggested-by: Mel Gorman <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 17 +++++++----------
1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 31cfdb0..ce36969 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3260,16 +3260,7 @@ static void task_numa_work(struct callback_head *work)
continue;
}

- /* Do not scan the VMA if task has not accessed */
- if (!vma_is_accessed(vma)) {
- trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
- continue;
- }
-
- /*
- * RESET access PIDs regularly for old VMAs. Resetting after checking
- * vma for recent access to avoid clearing PID info before access..
- */
+ /* RESET access PIDs regularly for old VMAs. */
if (mm->numa_scan_seq &&
time_after(jiffies, vma->numab_state->pids_active_reset)) {
vma->numab_state->pids_active_reset = vma->numab_state->pids_active_reset +
@@ -3278,6 +3269,12 @@ static void task_numa_work(struct callback_head *work)
vma->numab_state->pids_active[1] = 0;
}

+ /* Do not scan the VMA if task has not accessed */
+ if (!vma_is_accessed(vma)) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
+ continue;
+ }
+
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);

Subject: [tip: sched/core] sched/numa: Document vma_numab_state fields

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9ae5c00ea2e600a8b823f9b95606dd244f3096bf
Gitweb: https://git.kernel.org/tip/9ae5c00ea2e600a8b823f9b95606dd244f3096bf
Author: Mel Gorman <[email protected]>
AuthorDate: Tue, 10 Oct 2023 09:31:38 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 10 Oct 2023 11:10:00 +02:00

sched/numa: Document vma_numab_state fields

Document the intended usage of the fields.

[ mingo: Reformatted to take less vertical space & tidied it up. ]

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/mm_types.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 36c5b43..d7f042e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -551,8 +551,29 @@ struct vma_lock {
};

struct vma_numab_state {
+ /*
+ * Initialised as time in 'jiffies' after which VMA
+ * should be scanned. Delays first scan of new VMA by at
+ * least sysctl_numa_balancing_scan_delay:
+ */
unsigned long next_scan;
+
+ /*
+ * Time in jiffies when access_pids[] is reset to
+ * detect phase change behaviour:
+ */
unsigned long next_pid_reset;
+
+ /*
+ * Approximate tracking of PIDs that trapped a NUMA hinting
+ * fault. May produce false positives due to hash collisions.
+ *
+ * [0] Previous PID tracking
+ * [1] Current PID tracking
+ *
+ * Window moves after next_pid_reset has expired approximately
+ * every VMA_PID_RESET_PERIOD jiffies:
+ */
unsigned long access_pids[2];
};

2023-10-10 09:58:29

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative

On Tue, Oct 10, 2023 at 11:23:00AM +0200, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > On a 2-socket Cascade Lake test machine, the time to complete the
> > workload is as follows;
> >
> > 6.6.0-rc2 6.6.0-rc2
> > sched-numabtrace-v1 sched-numabselective-v1
> > Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%)
> > Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%*
> > Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%)
> > CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%)
> > Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%)
> >
> > The time to complete the workload is reduced by almost 30%
> >
> > 6.6.0-rc2 6.6.0-rc2
> > sched-numabtrace-v1 sched-numabselective-v1 /
> > Duration User 91201.80 63506.64
> > Duration System 2015.53 1819.78
> > Duration Elapsed 1234.77 868.37
> >
> > In this specific case, system CPU time was not increased but it's not
> > universally true.
> >
> > From vmstat, the NUMA scanning and fault activity is as follows;
> >
> > 6.6.0-rc2 6.6.0-rc2
> > sched-numabtrace-v1 sched-numabselective-v1
> > Ops NUMA base-page range updates 64272.00 26374386.00
> > Ops NUMA PTE updates 36624.00 55538.00
> > Ops NUMA PMD updates 54.00 51404.00
> > Ops NUMA hint faults 15504.00 75786.00
> > Ops NUMA hint local faults % 14860.00 56763.00
> > Ops NUMA hint local percent 95.85 74.90
> > Ops NUMA pages migrated 1629.00 6469222.00
> >
> > Both the number of PTE updates and hint faults is dramatically
> > increased. While this is superficially unfortunate, it represents
> > ranges that were simply skipped without the patch. As a result
> > of the scanning and hinting faults, many more pages were also
> > migrated but as the time to completion is reduced, the overhead
> > is offset by the gain.
>
> Nice! I've applied your series to tip:sched/core with a few non-functional
> edits to comment/changelog formatting/clarity.
>

Thanks.

> Btw., was any previous analysis done on the size of the pids_active[] hash
> and the hash collision rate?
>

Not that I'm aware of but I also think it would be difficult to design
something representative in terms of a benchmark. New pids are typically
sequential so most benchmarks are not going to show many collisions
unless the hash algorithm ignores lower bits. Maybe it does, I didn't
actually check the hash algorithm and if it does, that is likely the patch
justification right there -- threads created at similar times are almost
certain to collide). As it was Peter that suggested the hash, I assumed
he considered collisions due to lower bits but that is also lazy on my part.

If lower bits are used then it would pose the question -- does it
matter? The intent of the bitmap is for threads to prefer updating PTEs
within task-active VMAs but ultimately all VMAs should be scanned anyway so
some overhead will be usless. While collisions may occur, it's still better
than scanning within VMAs that are definitely *not* of interest. It would
suggest that a sensible direction would be to scan in passes like load
balancing uses fbq_type in find_busiest_queue() to filter what types of
tasks should be considered for moving. So, maybe the passes would look like

1. Task-active
2. Multiple tasks active
3. Any task active
4. Inactive

The objective would be that PTE updates are as relevant as possible
and hopefully by the time only inactive VMAs are considered, there is a
relatively small amount of wasted work.

> 64 (BITS_PER_LONG) feels a bit small, especially on larger machines running
> threaded workloads, and the kmalloc of numab_state likely allocates a full
> cacheline anyway, so we could double the hash size from 8 bytes (2x1 longs)
> to 32 bytes (2x2 longs) with very little real cost, and still have a long
> field left to spare?
>

You're right, we could and it's relatively cheap. I would worry that as the
storage overhead is per-VMA then workloads for large machines may also have
lots of VMAs that are not necessarily using threads. As I would struggle
to provide supporting data justifying the change, I would also be hesitant
to try merging it because if I was reviewing the patch for someone else,
the first question I would ask is "is there any performance benefit that
you can show?". I would expect the first patch would provide some
telemetry and the patch some justification.

--
Mel Gorman
SUSE Labs

2023-10-10 11:39:58

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 0/6] sched/numa: Complete scanning of partial and inactive VMAs

On 10/10/2023 2:01 PM, Mel Gorman wrote:
> NUMA Balancing currently uses PID fault activity within a VMA to
> determine if it is worth updating PTEs to trap NUMA hinting faults.
> While this is reduces overhead, it misses two important corner case.
> The first is that if Task A partially scans a VMA that is active and
> Task B resumes the scan but is inactive, then the remainder of the VMA
> may be missed. Similarly, if a VMA is inactive for a period of time then
> it may never be scanned again.
>
> Patches 1-3 improve the documentation of the current per-VMA tracking
> and adds a trace point for scan activity. Patch 4 addresses a corner
> case where the PID activity information may not be reset after the
> expected timeout. Patches 5-6 complete the scanning of partial and
> inactive VMAs within the scan sequence.
>
> This could be improved further but it would deserve a separate series on
> top with supporting data justifying the change. Otherwise and gain/loss
> due to the additional changes could be masked by this series on its own.
>

Thank you Mel for the patches. I see Ingo already took to sched/core.
Here is my testing detail FWIW.

SUT:
- 4th Generation EPYC System
- 2 x 128C/256T
- NPS1 mode

base: 6.6.-rc4
patch_v1r5: Mel's Initial series with prev_scan_seq = -1 fix
Link: https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/
sched-numabselective-v1r5

(May not be relevant. But I did see number was even more better for
thread_alloc, so ..)

patch_v1_r13: current series

numa01_thread_alloc
=============
base patch_v1r5 patch_v1r13

real 8m46.557s 8m29.040s 8m38.098s
user 599m6.070s 268m38.140s 404m52.065s
sys 3655m38.681s 3794m10.079s 3751m36.779s

numa_hit 394964680 396000482 393981391
numa_local 197351688 198242761 197008099
numa_other 197612992 197757721 196973292
numa_pte_updates 1160 790360 812
numa_hint_faults 755 729196 553
numa_hint_faults_local 754 410220 263
numa_pages_migrated 1 318976 290

num01
======

real 18m26.691s 17m31.770s 17m33.540s
user 4501m40.194s 2148m7.993s 3295m57.897s
sys 3483m11.684s 4764m57.876s 4215m35.599s

numa_hit 395473956 395813242 395000242
numa_local 197776626 198188480 197983594
numa_other 197697330 197624762 197016648
numa_pte_updates 1447 4625319 7142774
numa_hint_faults 1390 4947832 10313097
numa_hint_faults_local 1288 2758651 5354895
numa_pages_migrated 102 594803 960422


Thanks and Regards
- Raghu

2023-10-10 11:40:36

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative

On 10/10/2023 2:53 PM, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
[...]
>> Both the number of PTE updates and hint faults is dramatically
>> increased. While this is superficially unfortunate, it represents
>> ranges that were simply skipped without the patch. As a result
>> of the scanning and hinting faults, many more pages were also
>> migrated but as the time to completion is reduced, the overhead
>> is offset by the gain.
>
> Nice! I've applied your series to tip:sched/core with a few non-functional
> edits to comment/changelog formatting/clarity.
>
> Btw., was any previous analysis done on the size of the pids_active[] hash
> and the hash collision rate?
>

Hello Ingo,

I did test to understand the behaviour threaded workloads relation to
pids_active[], and that there is more spread of bits set etc, but not
actually from hash_collision point of view since it would work better
for workloads that does not create multiple threads quickly (thus we
have sparse PIDs) as well as normal case where we have almost continuous
PIDS.

But I also did not try to increase size of individual pids_active more
than 64 for the same reasoning of Mel that at the most we endup doing
cross PTE updates.

Perhaps I can experiment when I come across workloads that have
512/1000s of threads to refine these cross PTE updates further.

However, I have tried increasing history (PeterZ's patch) (i.e.,
increasing array size of
pids_active[]) to get a better VMA candidate for PTE update in

https://lore.kernel.org/all/[email protected]/T/

to handle what Mel suggested in other email.
viz.,

1. Task-active
2. Multiple tasks active
3. Any task active
4. Inactive

Thanks and Regards
- Raghu

2023-10-10 11:42:40

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative

On 10/10/2023 2:01 PM, Mel Gorman wrote:
> VMAs are skipped if there is no recent fault activity but this represents
> a chicken-and-egg problem as there may be no fault activity if the PTEs
> are never updated to trap NUMA hints. There is an indirect reliance on
> scanning to be forced early in the lifetime of a task but this may fail
> to detect changes in phase behaviour. Force inactive VMAs to be scanned
> when all other eligible VMAs have been updated within the same scan
> sequence.
>
> Test results in general look good with some changes in performance, both
> negative and positive, depending on whether the additional scanning and
> faulting was beneficial or not to the workload. The autonuma benchmark
> workload NUMA01_THREADLOCAL was picked for closer examination. The workload
> creates two processes with numerous threads and thread-local storage that
> is zero-filled in a loop. It exercises the corner case where unrelated
> threads may skip VMAs that are thread-local to another thread and still
> has some VMAs that inactive while the workload executes.
>
> The VMA skipping activity frequency with and without the patch is as
> follows;
>
> 6.6.0-rc2-sched-numabtrace-v1
> 649 reason=scan_delay
> 9094 reason=unsuitable
> 48915 reason=shared_ro
> 143919 reason=inaccessible
> 193050 reason=pid_inactive
>
> 6.6.0-rc2-sched-numabselective-v1
> 146 reason=seq_completed
> 622 reason=ignore_pid_inactive
> 624 reason=scan_delay
> 6570 reason=unsuitable
> 16101 reason=shared_ro
> 27608 reason=inaccessible
> 41939 reason=pid_inactive
>
> Note that with the patch applied, the PID activity is ignored
> (ignore_pid_inactive) to ensure a VMA with some activity is completely
> scanned. In addition, a small number of VMAs are scanned when no other
> eligible VMA is available during a single scan window (seq_completed).
> The number of times a VMA is skipped due to no PID activity from the
> scanning task (pid_inactive) drops dramatically. It is expected that
> this will increase the number of PTEs updated for NUMA hinting faults
> as well as hinting faults but these represent PTEs that would otherwise
> have been missed. The tradeoff is scan+fault overhead versus improving
> locality due to migration.
>
> On a 2-socket Cascade Lake test machine, the time to complete the
> workload is as follows;
>
> 6.6.0-rc2 6.6.0-rc2
> sched-numabtrace-v1 sched-numabselective-v1
> Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%)
> Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%*
> Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%)
> CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%)
> Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%)
>
> The time to complete the workload is reduced by almost 30%
>
> 6.6.0-rc2 6.6.0-rc2
> sched-numabtrace-v1 sched-numabselective-v1 /
> Duration User 91201.80 63506.64
> Duration System 2015.53 1819.78
> Duration Elapsed 1234.77 868.37
>
> In this specific case, system CPU time was not increased but it's not
> universally true.
>
> From vmstat, the NUMA scanning and fault activity is as follows;
>
> 6.6.0-rc2 6.6.0-rc2
> sched-numabtrace-v1 sched-numabselective-v1
> Ops NUMA base-page range updates 64272.00 26374386.00
> Ops NUMA PTE updates 36624.00 55538.00
> Ops NUMA PMD updates 54.00 51404.00
> Ops NUMA hint faults 15504.00 75786.00
> Ops NUMA hint local faults % 14860.00 56763.00
> Ops NUMA hint local percent 95.85 74.90
> Ops NUMA pages migrated 1629.00 6469222.00
>
> Both the number of PTE updates and hint faults is dramatically
> increased. While this is superficially unfortunate, it represents
> ranges that were simply skipped without the patch. As a result
> of the scanning and hinting faults, many more pages were also
> migrated but as the time to completion is reduced, the overhead
> is offset by the gain.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/mm_types.h | 6 +++
> include/linux/sched/numa_balancing.h | 1 +
> include/trace/events/sched.h | 3 +-
> kernel/sched/fair.c | 55 ++++++++++++++++++++++++++--
> 4 files changed, 61 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 8cb1dec3e358..a123c1a58617 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -578,6 +578,12 @@ struct vma_numab_state {
> * VMA_PID_RESET_PERIOD
> * jiffies.
> */
> + int prev_scan_seq; /* MM scan sequence ID when
> + * the VMA was last completely
> + * scanned. A VMA is not
> + * eligible for scanning if
> + * prev_scan_seq == numa_scan_seq
> + */
> };
>
> /*
> diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
> index 7dcc0bdfddbb..b69afb8630db 100644
> --- a/include/linux/sched/numa_balancing.h
> +++ b/include/linux/sched/numa_balancing.h
> @@ -22,6 +22,7 @@ enum numa_vmaskip_reason {
> NUMAB_SKIP_SCAN_DELAY,
> NUMAB_SKIP_PID_INACTIVE,
> NUMAB_SKIP_IGNORE_PID,
> + NUMAB_SKIP_SEQ_COMPLETED,
> };
>
> #ifdef CONFIG_NUMA_BALANCING
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 27b51c81b106..010ba1b7cb0e 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
> EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \
> EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \
> EM( NUMAB_SKIP_PID_INACTIVE, "pid_inactive" ) \
> - EMe(NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" )
> + EM( NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" ) \
> + EMe(NUMAB_SKIP_SEQ_COMPLETED, "seq_completed" )
>
> /* Redefine for export. */
> #undef EM
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 150f01948ec6..72ef60f394ba 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3175,6 +3175,8 @@ static void task_numa_work(struct callback_head *work)
> unsigned long nr_pte_updates = 0;
> long pages, virtpages;
> struct vma_iterator vmi;
> + bool vma_pids_skipped;
> + bool vma_pids_forced = false;
>
> SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
>
> @@ -3217,7 +3219,6 @@ static void task_numa_work(struct callback_head *work)
> */
> p->node_stamp += 2 * TICK_NSEC;
>
> - start = mm->numa_scan_offset;
> pages = sysctl_numa_balancing_scan_size;
> pages <<= 20 - PAGE_SHIFT; /* MB in pages */
> virtpages = pages * 8; /* Scan up to this much virtual space */
> @@ -3227,6 +3228,16 @@ static void task_numa_work(struct callback_head *work)
>
> if (!mmap_read_trylock(mm))
> return;
> +
> + /*
> + * VMAs are skipped if the current PID has not trapped a fault within
> + * the VMA recently. Allow scanning to be forced if there is no
> + * suitable VMA remaining.
> + */
> + vma_pids_skipped = false;
> +
> +retry_pids:
> + start = mm->numa_scan_offset;
> vma_iter_init(&vmi, mm, start);
> vma = vma_next(&vmi);
> if (!vma) {
> @@ -3277,6 +3288,13 @@ static void task_numa_work(struct callback_head *work)
> /* Reset happens after 4 times scan delay of scan start */
> vma->numab_state->pids_active_reset = vma->numab_state->next_scan +
> msecs_to_jiffies(VMA_PID_RESET_PERIOD);
> +
> + /*
> + * Ensure prev_scan_seq does not match numa_scan_seq
> + * to prevent VMAs being skipped prematurely on the
> + * first scan.
> + */
> + vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;

nit:
Perhaps even vma->numab_state->prev_scan_seq = -1 would have worked, but
does not matter.

> }

2023-10-10 21:40:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative


* Mel Gorman <[email protected]> wrote:

> > 64 (BITS_PER_LONG) feels a bit small, especially on larger machines running
> > threaded workloads, and the kmalloc of numab_state likely allocates a full
> > cacheline anyway, so we could double the hash size from 8 bytes (2x1 longs)
^--- 16 bytes
> > to 32 bytes (2x2 longs) with very little real cost, and still have a long
> > field left to spare?
> >
>
> You're right, we could and it's relatively cheap. I would worry that as
> the storage overhead is per-VMA then workloads for large machines may
> also have lots of VMAs that are not necessarily using threads.

So I think there would be *zero* extra per-vma storage overhead, because
vma->numab_state is a pointer, with the structure kmalloc() allocated,
which should round the allocation to cacheline granularity anyway: 64 bytes
on NUMA systems that matter.

So with the current size of 'struct vma_numab_state' of 36 bytes, we can
extend it by 16 bytes with zero additional storage cost.

And since there's no cost, and less hash collisions are always good, the
change wouldn't need any other justification. :-)

[ Plus the resulting abstraction for the definition of a larger bitmask
would probably make future extensions easier. ]

But ... it was just a suggestion.

Thanks,

Ingo

2023-10-10 21:45:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/6] sched/numa: Complete scanning of partial and inactive VMAs


* Raghavendra K T <[email protected]> wrote:

> On 10/10/2023 2:01 PM, Mel Gorman wrote:
> > NUMA Balancing currently uses PID fault activity within a VMA to
> > determine if it is worth updating PTEs to trap NUMA hinting faults.
> > While this is reduces overhead, it misses two important corner case.
> > The first is that if Task A partially scans a VMA that is active and
> > Task B resumes the scan but is inactive, then the remainder of the VMA
> > may be missed. Similarly, if a VMA is inactive for a period of time then
> > it may never be scanned again.
> >
> > Patches 1-3 improve the documentation of the current per-VMA tracking
> > and adds a trace point for scan activity. Patch 4 addresses a corner
> > case where the PID activity information may not be reset after the
> > expected timeout. Patches 5-6 complete the scanning of partial and
> > inactive VMAs within the scan sequence.
> >
> > This could be improved further but it would deserve a separate series on
> > top with supporting data justifying the change. Otherwise and gain/loss
> > due to the additional changes could be masked by this series on its own.
> >
>
> Thank you Mel for the patches. I see Ingo already took to sched/core.
> Here is my testing detail FWIW.

Thank you for testing the series, I've added your Tested-by to the final
two patches that change behavior materially:

Tested-by: Raghavendra K T <[email protected]>

Thanks,

Ingo

Subject: [tip: sched/core] sched/numa: Complete scanning of partial VMAs regardless of PID activity

The following commit has been merged into the sched/core branch of tip:

Commit-ID: b7a5b537c55c088d891ae554103d1b281abef781
Gitweb: https://git.kernel.org/tip/b7a5b537c55c088d891ae554103d1b281abef781
Author: Mel Gorman <[email protected]>
AuthorDate: Tue, 10 Oct 2023 09:31:42 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 10 Oct 2023 23:41:47 +02:00

sched/numa: Complete scanning of partial VMAs regardless of PID activity

NUMA Balancing skips VMAs when the current task has not trapped a NUMA
fault within the VMA. If the VMA is skipped then mm->numa_scan_offset
advances and a task that is trapping faults within the VMA may never
fully update PTEs within the VMA.

Force tasks to update PTEs for partially scanned PTEs. The VMA will
be tagged for NUMA hints by some task but this removes some of the
benefit of tracking PID activity within a VMA. A follow-on patch
will mitigate this problem.

The test cases and machines evaluated did not trigger the corner case so
the performance results are neutral with only small changes within the
noise from normal test-to-test variance. However, the next patch makes
the corner case easier to trigger.

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Raghavendra K T <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/sched/numa_balancing.h | 1 +
include/trace/events/sched.h | 3 ++-
kernel/sched/fair.c | 18 +++++++++++++++---
3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index c127a15..7dcc0bd 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -21,6 +21,7 @@ enum numa_vmaskip_reason {
NUMAB_SKIP_INACCESSIBLE,
NUMAB_SKIP_SCAN_DELAY,
NUMAB_SKIP_PID_INACTIVE,
+ NUMAB_SKIP_IGNORE_PID,
};

#ifdef CONFIG_NUMA_BALANCING
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index d82a04d..bfc07c1 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -670,7 +670,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
EM( NUMAB_SKIP_SHARED_RO, "shared_ro" ) \
EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \
EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \
- EMe(NUMAB_SKIP_PID_INACTIVE, "pid_inactive" )
+ EM( NUMAB_SKIP_PID_INACTIVE, "pid_inactive" ) \
+ EMe(NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" )

/* Redefine for export. */
#undef EM
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce36969..ab79013 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3113,7 +3113,7 @@ static void reset_ptenuma_scan(struct task_struct *p)
p->mm->numa_scan_offset = 0;
}

-static bool vma_is_accessed(struct vm_area_struct *vma)
+static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
{
unsigned long pids;
/*
@@ -3126,7 +3126,19 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
return true;

pids = vma->numab_state->pids_active[0] | vma->numab_state->pids_active[1];
- return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
+ if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
+ return true;
+
+ /*
+ * Complete a scan that has already started regardless of PID access, or
+ * some VMAs may never be scanned in multi-threaded applications:
+ */
+ if (mm->numa_scan_offset > vma->vm_start) {
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_IGNORE_PID);
+ return true;
+ }
+
+ return false;
}

#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
@@ -3270,7 +3282,7 @@ static void task_numa_work(struct callback_head *work)
}

/* Do not scan the VMA if task has not accessed */
- if (!vma_is_accessed(vma)) {
+ if (!vma_is_accessed(mm, vma)) {
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
continue;
}

Subject: [tip: sched/core] sched/numa: Complete scanning of inactive VMAs when there is no alternative

The following commit has been merged into the sched/core branch of tip:

Commit-ID: f169c62ff7cd1acf8bac8ae17bfeafa307d9e6fa
Gitweb: https://git.kernel.org/tip/f169c62ff7cd1acf8bac8ae17bfeafa307d9e6fa
Author: Mel Gorman <[email protected]>
AuthorDate: Tue, 10 Oct 2023 09:31:43 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 10 Oct 2023 23:42:15 +02:00

sched/numa: Complete scanning of inactive VMAs when there is no alternative

VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.

Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.

The VMA skipping activity frequency with and without the patch:

6.6.0-rc2-sched-numabtrace-v1
=============================
649 reason=scan_delay
9,094 reason=unsuitable
48,915 reason=shared_ro
143,919 reason=inaccessible
193,050 reason=pid_inactive

6.6.0-rc2-sched-numabselective-v1
=============================
146 reason=seq_completed
622 reason=ignore_pid_inactive

624 reason=scan_delay
6,570 reason=unsuitable
16,101 reason=shared_ro
27,608 reason=inaccessible
41,939 reason=pid_inactive

Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.

On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;

6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%)
Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%*
Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%)
CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%)
Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%)

The time to complete the workload is reduced by almost 30%:

6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1 /
Duration User 91201.80 63506.64
Duration System 2015.53 1819.78
Duration Elapsed 1234.77 868.37

In this specific case, system CPU time was not increased but it's not
universally true.

>From vmstat, the NUMA scanning and fault activity is as follows;

6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Ops NUMA base-page range updates 64272.00 26374386.00
Ops NUMA PTE updates 36624.00 55538.00
Ops NUMA PMD updates 54.00 51404.00
Ops NUMA hint faults 15504.00 75786.00
Ops NUMA hint local faults % 14860.00 56763.00
Ops NUMA hint local percent 95.85 74.90
Ops NUMA pages migrated 1629.00 6469222.00

Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Raghavendra K T <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/mm_types.h | 6 +++-
include/linux/sched/numa_balancing.h | 1 +-
include/trace/events/sched.h | 3 +-
kernel/sched/fair.c | 55 +++++++++++++++++++++++++--
4 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e7571ec..589f31e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -575,6 +575,12 @@ struct vma_numab_state {
* every VMA_PID_RESET_PERIOD jiffies:
*/
unsigned long pids_active[2];
+
+ /*
+ * MM scan sequence ID when the VMA was last completely scanned.
+ * A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
+ */
+ int prev_scan_seq;
};

/*
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 7dcc0bd..b69afb8 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -22,6 +22,7 @@ enum numa_vmaskip_reason {
NUMAB_SKIP_SCAN_DELAY,
NUMAB_SKIP_PID_INACTIVE,
NUMAB_SKIP_IGNORE_PID,
+ NUMAB_SKIP_SEQ_COMPLETED,
};

#ifdef CONFIG_NUMA_BALANCING
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index bfc07c1..6188ad0 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \
EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \
EM( NUMAB_SKIP_PID_INACTIVE, "pid_inactive" ) \
- EMe(NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" )
+ EM( NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" ) \
+ EMe(NUMAB_SKIP_SEQ_COMPLETED, "seq_completed" )

/* Redefine for export. */
#undef EM
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab79013..9229051 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3158,6 +3158,8 @@ static void task_numa_work(struct callback_head *work)
unsigned long nr_pte_updates = 0;
long pages, virtpages;
struct vma_iterator vmi;
+ bool vma_pids_skipped;
+ bool vma_pids_forced = false;

SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));

@@ -3200,7 +3202,6 @@ static void task_numa_work(struct callback_head *work)
*/
p->node_stamp += 2 * TICK_NSEC;

- start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
virtpages = pages * 8; /* Scan up to this much virtual space */
@@ -3210,6 +3211,16 @@ static void task_numa_work(struct callback_head *work)

if (!mmap_read_trylock(mm))
return;
+
+ /*
+ * VMAs are skipped if the current PID has not trapped a fault within
+ * the VMA recently. Allow scanning to be forced if there is no
+ * suitable VMA remaining.
+ */
+ vma_pids_skipped = false;
+
+retry_pids:
+ start = mm->numa_scan_offset;
vma_iter_init(&vmi, mm, start);
vma = vma_next(&vmi);
if (!vma) {
@@ -3260,6 +3271,13 @@ static void task_numa_work(struct callback_head *work)
/* Reset happens after 4 times scan delay of scan start */
vma->numab_state->pids_active_reset = vma->numab_state->next_scan +
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+ /*
+ * Ensure prev_scan_seq does not match numa_scan_seq,
+ * to prevent VMAs being skipped prematurely on the
+ * first scan:
+ */
+ vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;
}

/*
@@ -3281,8 +3299,19 @@ static void task_numa_work(struct callback_head *work)
vma->numab_state->pids_active[1] = 0;
}

- /* Do not scan the VMA if task has not accessed */
- if (!vma_is_accessed(mm, vma)) {
+ /* Do not rescan VMAs twice within the same sequence. */
+ if (vma->numab_state->prev_scan_seq == mm->numa_scan_seq) {
+ mm->numa_scan_offset = vma->vm_end;
+ trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SEQ_COMPLETED);
+ continue;
+ }
+
+ /*
+ * Do not scan the VMA if task has not accessed it, unless no other
+ * VMA candidate exists.
+ */
+ if (!vma_pids_forced && !vma_is_accessed(mm, vma)) {
+ vma_pids_skipped = true;
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
continue;
}
@@ -3311,8 +3340,28 @@ static void task_numa_work(struct callback_head *work)

cond_resched();
} while (end != vma->vm_end);
+
+ /* VMA scan is complete, do not scan until next sequence. */
+ vma->numab_state->prev_scan_seq = mm->numa_scan_seq;
+
+ /*
+ * Only force scan within one VMA at a time, to limit the
+ * cost of scanning a potentially uninteresting VMA.
+ */
+ if (vma_pids_forced)
+ break;
} for_each_vma(vmi, vma);

+ /*
+ * If no VMAs are remaining and VMAs were skipped due to the PID
+ * not accessing the VMA previously, then force a scan to ensure
+ * forward progress:
+ */
+ if (!vma && !vma_pids_forced && vma_pids_skipped) {
+ vma_pids_forced = true;
+ goto retry_pids;
+ }
+
out:
/*
* It is possible to reach the end of the VMA list but the last few

2024-02-24 04:50:55

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative

On 10/10/2023 5:10 PM, Raghavendra K T wrote:
> On 10/10/2023 2:53 PM, Ingo Molnar wrote:
>>
>> * Mel Gorman <[email protected]> wrote:
>>
> [...]
>>> Both the number of PTE updates and hint faults is dramatically
>>> increased. While this is superficially unfortunate, it represents
>>> ranges that were simply skipped without the patch. As a result
>>> of the scanning and hinting faults, many more pages were also
>>> migrated but as the time to completion is reduced, the overhead
>>> is offset by the gain.
>>
>> Nice! I've applied your series to tip:sched/core with a few
>> non-functional
>> edits to comment/changelog formatting/clarity.
>>
>> Btw., was any previous analysis done on the size of the pids_active[]
>> hash
>> and the hash collision rate?
>>

Hello Ingo,

I did complete the hash collision experiment you asked long back, But
did not report soon (Year end time) . Sorry for that.

Here is the summary:

Context:
========
Currently in VMA scanning we hash the PID value into 6bit value
so that we can use 8 Byte long variable to record which PID had accessed
VMA to optimize scanning (HASH32 method) suggested by PeterZ.

functions used: hash32(PID, ilog2(BITS_PER_LONG))


Alternate was to directly use last 6 bits of PID (PID_6BIT method).

Current experiment evaluates how the distribution or collision looks like.

Experiment:
Main thread
- Allocates large memory.
- Creates n thread that that sweeps allocated memory for fixed iterations
(n = 8,16,32,64)

(All these threads are created without delay)

Note down hash32 value for the threads generated from trace.

Note down the PIDs of the threads and extract 6 bits.

Generate a hashvalue-frequency list.

Notes:
1) 8 Thread experiment will have 8 thread + 1 main thread hashvalue so on

2) When we have large number of threads some time all the thread may not
get the chance to scan VMA and hence total count may be less.

Observations:
===============
1) PID_6BIT generates hashvalues which are crowded.

2) HASH32 generates a very well distributed hash values (as expected).

3) There are no collisions when total threads created is less than 64
in both the cases.

4) When number of Threads created = 64 we see more collision in HASH32
method. But false positives did not seem to be an issue from the experiments
so far. Also number of collisions are not that high too.

I think we got lucky in PID_6BIT case.

Overall hash32 service the intended purpose.

(Ingo, I have experimented with extending total PID info stored from 64
- 128
on larger systems, will post it separately with the patch)

==================
iter0/8-thread
==================

PID_6BIT HASH32
(value-freq) (value-freq)

0 - 1 5 - 1
1 - 1 14 - 1
2 - 1 20 - 1
3 - 1 29 - 1
4 - 1 35 - 1
5 - 1 44 - 1
56 - 1 52 - 1
62 - 1 54 - 1
63 - 1 59 - 1

==================
iter0/16-thread
==================
0 - 1 3 - 1
1 - 1 9 - 1
2 - 1 12 - 1
3 - 1 14 - 1
4 - 1 18 - 1
5 - 1 24 - 1
6 - 1 27 - 1
7 - 1 33 - 1
8 - 1 37 - 1
9 - 1 39 - 1
10 - 1 42 - 1
11 - 1 48 - 1
59 - 1 52 - 1
60 - 1 54 - 1
61 - 1 58 - 1
62 - 1 61 - 1
63 - 1 63 - 1

==================
iter0/32-thread
==================
0 - 1 0 - 1
1 - 1 2 - 1
2 - 1 4 - 1
3 - 1 5 - 1
4 - 1 8 - 1
5 - 1 11 - 1
6 - 1 13 - 1
7 - 1 15 - 1
8 - 1 17 - 1
9 - 1 19 - 1
10 - 1 20 - 1
11 - 1 23 - 1
12 - 1 24 - 1
13 - 1 26 - 1
14 - 1 28 - 1
15 - 1 30 - 1
16 - 1 32 - 1
48 - 1 34 - 1
49 - 1 36 - 1
50 - 1 38 - 1
51 - 1 39 - 1
52 - 1 41 - 1
53 - 1 44 - 1
54 - 1 45 - 1
55 - 1 47 - 1
56 - 1 48 - 1
57 - 1 51 - 1
58 - 1 53 - 1
59 - 1 54 - 1
60 - 1 56 - 1
61 - 1 59 - 1
62 - 1 60 - 1
63 - 1 62 - 1

==================
iter0/64-thread
==================
0 - 1 0 - 1
1 - 1 1 - 1
2 - 1 2 - 1
3 - 1 4 - 1
4 - 1 6 - 1
5 - 1 7 - 1
6 - 1 8 - 1
7 - 1 9 - 1
8 - 1 10 - 1
9 - 1 12 - 1
10 - 1 13 - 1
11 - 1 15 - 1
12 - 1 16 - 1
15 - 1 17 - 2
16 - 1 19 - 1
18 - 1 20 - 1
20 - 1 21 - 1
22 - 1 22 - 1
23 - 1 23 - 1
24 - 1 25 - 2
25 - 2 27 - 1
26 - 1 29 - 1
27 - 1 30 - 1
28 - 1 31 - 1
29 - 1 32 - 1
30 - 1 33 - 1
31 - 1 34 - 1
32 - 1 35 - 1
33 - 1 36 - 1
34 - 1 37 - 1
35 - 1 40 - 2
36 - 1 41 - 1
37 - 1 42 - 1
38 - 1 43 - 1
39 - 1 44 - 1
40 - 1 45 - 1
41 - 1 46 - 1
42 - 1 47 - 1
43 - 1 48 - 1
44 - 1 49 - 1
45 - 1 50 - 1
48 - 1 53 - 2
49 - 1 55 - 1
50 - 1 56 - 2
51 - 1 57 - 1
52 - 1 58 - 1
53 - 1 59 - 1
54 - 1 61 - 2
55 - 1 62 - 1
56 - 1 63 - 1
57 - 1
60 - 1
61 - 1
62 - 1
63 - 1

==================
iter1/8-thread
==================
53 - 1 8 - 1
55 - 1 17 - 1
56 - 1 23 - 1
57 - 1 33 - 1
58 - 1 38 - 1
59 - 1 48 - 1
60 - 1 53 - 1
61 - 1 57 - 1
62 - 1 63 - 1

==================
iter1/16-thread
==================
4 - 1 0 - 1
5 - 1 6 - 1
7 - 1 9 - 1
8 - 1 15 - 1
9 - 1 25 - 1
10 - 1 30 - 1
11 - 1 36 - 1
12 - 1 40 - 1
13 - 1 43 - 1
14 - 1 45 - 1
15 - 1 49 - 1
16 - 1 55 - 1
18 - 1 58 - 1
20 - 1 61 - 1

==================
iter1/32-thread
==================
27 - 1 1 - 1
28 - 1 3 - 1
29 - 1 5 - 1
30 - 1 7 - 1
31 - 1 9 - 1
32 - 1 11 - 1
33 - 1 12 - 1
34 - 1 14 - 1
35 - 1 17 - 1
36 - 1 18 - 1
37 - 1 20 - 1
38 - 1 22 - 1
39 - 1 24 - 1
40 - 1 26 - 1
41 - 1 27 - 1
42 - 1 29 - 1
43 - 1 32 - 1
44 - 1 33 - 1
45 - 1 35 - 1
46 - 1 37 - 1
47 - 1 39 - 1
48 - 1 41 - 1
49 - 1 42 - 1
50 - 1 45 - 1
51 - 1 47 - 1
52 - 1 48 - 1
53 - 1 50 - 1
54 - 1 52 - 1
55 - 1 54 - 1
56 - 1 56 - 1
57 - 1 58 - 1
58 - 1 60 - 1
59 - 1 63 - 1

==================
iter1/64-thread
==================
0 - 1 0 - 1
1 - 1 1 - 1
2 - 1 2 - 1
3 - 1 3 - 1
4 - 1 4 - 2
5 - 1 6 - 1
6 - 1 7 - 2
7 - 1 8 - 1
10 - 1 9 - 1
12 - 1 10 - 1
14 - 1 12 - 1
15 - 1 13 - 1
16 - 2 14 - 1
18 - 1 15 - 1
19 - 1 16 - 1
20 - 1 17 - 1
21 - 1 19 - 1
22 - 1 20 - 1
23 - 1 22 - 1
24 - 1 23 - 1
25 - 1 24 - 1
26 - 1 25 - 1
27 - 1 26 - 1
28 - 1 27 - 1
31 - 1 30 - 1
33 - 1 31 - 1
34 - 1 32 - 2
35 - 1 34 - 1
36 - 1 35 - 1
37 - 1 36 - 1
38 - 1 37 - 1
39 - 1 40 - 1
40 - 1 41 - 1
41 - 1 42 - 1
42 - 1 44 - 1
43 - 1 45 - 1
44 - 1 46 - 1
45 - 1 47 - 1
46 - 1 48 - 1
48 - 1 49 - 1
49 - 1 50 - 1
50 - 1 51 - 1
51 - 1 52 - 1
52 - 1 55 - 1
54 - 1 57 - 1
55 - 1 58 - 1
56 - 1 59 - 1
57 - 1 60 - 1
58 - 1 62 - 1
59 - 1 63 - 1
60 - 1
62 - 1

Thanks and Regards
- Raghu