2021-09-21 16:15:36

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH 0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support

This series introduces an alternative locking scheme around mm/swap.c's per-cpu
LRU pagevec caches and mm/page_alloc.c's per-cpu page lists which will allow
for remote CPUs to drain them. Currently, only a local CPU is permitted to
change its per-cpu lists, and it's expected to do so, on-demand, whenever a
process demands it (by means of queueing an drain task on the local CPU). Most
systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
that can't take any sort of interruption without breaking their functional
guarantees (latency, bandwidth, etc...). Having a way for these processes to
remotely drain the lists themselves will make co-existing with isolated CPUs
possible, at the cost of more constraining locks.

Fortunately for non-NOHZ_FULL users, the alternative locking scheme and remote
drain code are conditional to a static key which is disabled by default. This
guarantees minimal functional or performance regressions. The feature will only
be enabled if NOHZ_FULL's initialization process was successful.

This work is based on a previous series by Thomas Gleixner, Anna-Maria
Gleixner, and Sebastian Andrzej Siewior[1].

[1] https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/

Nicolas Saenz Julienne (6):
mm/swap: Introduce lru_cpu_needs_drain()
mm/swap: Introduce alternative per-cpu LRU cache locking
mm/swap: Allow remote LRU cache draining
mm/page_alloc: Introduce alternative per-cpu list locking
mm/page_alloc: Allow remote per-cpu page list draining
sched/isolation: Enable 'remote_pcpu_cache_access' on NOHZ_FULL
systems

kernel/sched/isolation.c | 9 +-
mm/internal.h | 2 +
mm/page_alloc.c | 111 ++++++++++++++++-----
mm/swap.c | 202 ++++++++++++++++++++++++++++++---------
4 files changed, 253 insertions(+), 71 deletions(-)

--
2.31.1


2021-09-21 16:15:43

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH 1/6] mm/swap: Introduce lru_cpu_needs_drain()

Upcoming patches will need to check whether a CPU needs to drain its LRU
pagevecs on multiple locations. So move the check into its own function.

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---
mm/swap.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 04b678342c02..e7f9e4018ccf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -726,6 +726,17 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
lru_add_drain();
}

+static bool lru_cpu_needs_drain(int cpu)
+{
+ return pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
+ data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
+ pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
+ pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
+ pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
+ need_activate_page_drain(cpu) ||
+ has_bh_in_lru(cpu, NULL);
+}
+
/*
* Doesn't need any cpu hotplug locking because we do rely on per-cpu
* kworkers being shut down before our page_alloc_cpu_dead callback is
@@ -808,14 +819,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);

- if (force_all_cpus ||
- pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
- data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
- pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
- pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
- pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
- need_activate_page_drain(cpu) ||
- has_bh_in_lru(cpu, NULL)) {
+ if (force_all_cpus || lru_cpu_needs_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
queue_work_on(cpu, mm_percpu_wq, work);
__cpumask_set_cpu(cpu, &has_work);
--
2.31.1

2021-09-21 16:15:51

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH 2/6] mm/swap: Introduce alternative per-cpu LRU cache locking

'struct lru_pvecs' and 'struct lru_rotate' per-CPU pagevecs are
protected using local locks. While performance savvy, this doesn't
allow for remote access to these structures. CPUs requiring system-wide
LRU cache drains get around this by scheduling drain work on all CPUs.
That said, some select setups like systems with NOHZ_FULL CPUs, aren't
well suited to this, as they can't handle interruptions of any sort.

To mitigate this, introduce an alternative locking scheme using
spinlocks that will permit remotely accessing these per-CPU pagevecs.
It's disabled by default, with no functional change to regular users,
and enabled through the 'remote_pcpu_cache_access' static key. Upcoming
patches will make use of this static key.

The naming of the static key is left vague on purpose as it is planned
to also enable a similar locking setup to access mm/page_alloc.c's
per-cpu page lists remotely.

This is based on previous work by Thomas Gleixner, Anna-Maria Gleixner,
and Sebastian Andrzej Siewior[1].

[1] https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---
mm/internal.h | 2 +
mm/swap.c | 152 +++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 128 insertions(+), 26 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 18256e32a14c..5a2cef7cd394 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -32,6 +32,8 @@
/* Do not use these with a slab allocator */
#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)

+extern struct static_key_false remote_pcpu_cache_access;
+
void page_writeback_init(void);

static inline void *folio_raw_mapping(struct folio *folio)
diff --git a/mm/swap.c b/mm/swap.c
index e7f9e4018ccf..bcf73bd563a6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -46,13 +46,27 @@
/* How many pages do we try to swap or page in/out together? */
int page_cluster;

+/*
+ * On some setups, like with nohz_full, CPUs might be too busy to handle
+ * per-cpu drain work, leading to unwarranted interruptions and hangs. This
+ * key, when enabled, allows for remote draining of these per-cpu caches/page
+ * lists at the cost of more constraining locking.
+ */
+__ro_after_init DEFINE_STATIC_KEY_FALSE(remote_pcpu_cache_access);
+
+struct lru_cache_locks {
+ local_lock_t local;
+ spinlock_t spin;
+};
+
/* Protecting only lru_rotate.pvec which requires disabling interrupts */
struct lru_rotate {
- local_lock_t lock;
+ struct lru_cache_locks locks;
struct pagevec pvec;
};
static DEFINE_PER_CPU(struct lru_rotate, lru_rotate) = {
- .lock = INIT_LOCAL_LOCK(lock),
+ .locks.local = INIT_LOCAL_LOCK(lru_rotate.locks.local),
+ .locks.spin = __SPIN_LOCK_UNLOCKED(lru_rotate.locks.spin),
};

/*
@@ -60,7 +74,7 @@ static DEFINE_PER_CPU(struct lru_rotate, lru_rotate) = {
* by disabling preemption (and interrupts remain enabled).
*/
struct lru_pvecs {
- local_lock_t lock;
+ struct lru_cache_locks locks;
struct pagevec lru_add;
struct pagevec lru_deactivate_file;
struct pagevec lru_deactivate;
@@ -70,9 +84,94 @@ struct lru_pvecs {
#endif
};
static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
- .lock = INIT_LOCAL_LOCK(lock),
+ .locks.local = INIT_LOCAL_LOCK(lru_pvecs.locks.local),
+ .locks.spin = __SPIN_LOCK_UNLOCKED(lru_pvecs.locks.spin),
};

+static inline void lru_cache_lock(struct lru_cache_locks *locks)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ /* Avoid migration between this_cpu_ptr() and spin_lock() */
+ migrate_disable();
+ spin_lock(this_cpu_ptr(&locks->spin));
+ } else {
+ local_lock(&locks->local);
+ }
+}
+
+static inline void lru_cache_lock_irqsave(struct lru_cache_locks *locks,
+ unsigned long *flagsp)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ /* Avoid migration between this_cpu_ptr() and spin_lock_irqsave() */
+ migrate_disable();
+ spin_lock_irqsave(this_cpu_ptr(&locks->spin), *flagsp);
+ } else {
+ local_lock_irqsave(&locks->local, *flagsp);
+ }
+}
+
+/*
+ * The lru_cache_lock_cpu()/lru_cache_lock_irqsave_cpu() flavor of functions
+ * should only be used from remote CPUs when 'remote_pcpu_cache_access' is
+ * enabled or the target CPU is dead. Otherwise, it can still be called on the
+ * local CPU with migration disabled.
+ */
+static inline void lru_cache_lock_cpu(struct lru_cache_locks *locks, int cpu)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access))
+ spin_lock(per_cpu_ptr(&locks->spin, cpu));
+ else
+ local_lock(&locks->local);
+}
+
+static inline void lru_cache_lock_irqsave_cpu(struct lru_cache_locks *locks,
+ unsigned long *flagsp, int cpu)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access))
+ spin_lock_irqsave(per_cpu_ptr(&locks->spin, cpu), *flagsp);
+ else
+ local_lock_irqsave(&locks->local, *flagsp);
+}
+
+static inline void lru_cache_unlock(struct lru_cache_locks *locks)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ spin_unlock(this_cpu_ptr(&locks->spin));
+ migrate_enable();
+ } else {
+ local_unlock(&locks->local);
+ }
+}
+
+static inline void lru_cache_unlock_irqrestore(struct lru_cache_locks *locks,
+ unsigned long flags)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ spin_unlock_irqrestore(this_cpu_ptr(&locks->spin), flags);
+ migrate_enable();
+ } else {
+ local_unlock_irqrestore(&locks->local, flags);
+ }
+}
+
+static inline void lru_cache_unlock_cpu(struct lru_cache_locks *locks, int cpu)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access))
+ spin_unlock(per_cpu_ptr(&locks->spin, cpu));
+ else
+ local_unlock(&locks->local);
+}
+
+static inline void lru_cache_unlock_irqrestore_cpu(struct lru_cache_locks *locks,
+ unsigned long flags, int cpu)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access))
+ spin_unlock_irqrestore(per_cpu_ptr(&locks->spin, cpu), flags);
+ else
+ local_unlock_irqrestore(&locks->local, flags);
+}
+
/*
* This path almost never happens for VM activity - pages are normally
* freed via pagevecs. But it gets used by networking.
@@ -245,11 +344,11 @@ void folio_rotate_reclaimable(struct folio *folio)
unsigned long flags;

folio_get(folio);
- local_lock_irqsave(&lru_rotate.lock, flags);
+ lru_cache_lock_irqsave(&lru_rotate.locks, &flags);
pvec = this_cpu_ptr(&lru_rotate.pvec);
if (pagevec_add_and_need_flush(pvec, &folio->page))
pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
- local_unlock_irqrestore(&lru_rotate.lock, flags);
+ lru_cache_unlock_irqrestore(&lru_rotate.locks, flags);
}
}

@@ -341,11 +440,11 @@ static void folio_activate(struct folio *folio)
struct pagevec *pvec;

folio_get(folio);
- local_lock(&lru_pvecs.lock);
+ lru_cache_lock(&lru_pvecs.locks);
pvec = this_cpu_ptr(&lru_pvecs.activate_page);
if (pagevec_add_and_need_flush(pvec, &folio->page))
pagevec_lru_move_fn(pvec, __activate_page);
- local_unlock(&lru_pvecs.lock);
+ lru_cache_unlock(&lru_pvecs.locks);
}
}

@@ -372,7 +471,7 @@ static void __lru_cache_activate_folio(struct folio *folio)
struct pagevec *pvec;
int i;

- local_lock(&lru_pvecs.lock);
+ lru_cache_lock(&lru_pvecs.locks);
pvec = this_cpu_ptr(&lru_pvecs.lru_add);

/*
@@ -394,7 +493,7 @@ static void __lru_cache_activate_folio(struct folio *folio)
}
}

- local_unlock(&lru_pvecs.lock);
+ lru_cache_unlock(&lru_pvecs.locks);
}

/*
@@ -453,11 +552,11 @@ void folio_add_lru(struct folio *folio)
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

folio_get(folio);
- local_lock(&lru_pvecs.lock);
+ lru_cache_lock(&lru_pvecs.locks);
pvec = this_cpu_ptr(&lru_pvecs.lru_add);
if (pagevec_add_and_need_flush(pvec, &folio->page))
__pagevec_lru_add(pvec);
- local_unlock(&lru_pvecs.lock);
+ lru_cache_unlock(&lru_pvecs.locks);
}
EXPORT_SYMBOL(folio_add_lru);

@@ -592,8 +691,9 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)

/*
* Drain pages out of the cpu's pagevecs.
- * Either "cpu" is the current CPU, and preemption has already been
- * disabled; or "cpu" is being hot-unplugged, and is already dead.
+ * Either "cpu" is the current CPU, and preemption has already been disabled,
+ * or we're remotely flushing pvecs with the 'remote_pcpu_cache_access' key
+ * enabled, or "cpu" is being hot-unplugged and is already dead.
*/
void lru_add_drain_cpu(int cpu)
{
@@ -608,9 +708,9 @@ void lru_add_drain_cpu(int cpu)
unsigned long flags;

/* No harm done if a racing interrupt already did this */
- local_lock_irqsave(&lru_rotate.lock, flags);
+ lru_cache_lock_irqsave_cpu(&lru_rotate.locks, &flags, cpu);
pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
- local_unlock_irqrestore(&lru_rotate.lock, flags);
+ lru_cache_unlock_irqrestore_cpu(&lru_rotate.locks, flags, cpu);
}

pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu);
@@ -649,12 +749,12 @@ void deactivate_file_page(struct page *page)
if (likely(get_page_unless_zero(page))) {
struct pagevec *pvec;

- local_lock(&lru_pvecs.lock);
+ lru_cache_lock(&lru_pvecs.locks);
pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);

if (pagevec_add_and_need_flush(pvec, page))
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
- local_unlock(&lru_pvecs.lock);
+ lru_cache_unlock(&lru_pvecs.locks);
}
}

@@ -671,12 +771,12 @@ void deactivate_page(struct page *page)
if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
struct pagevec *pvec;

- local_lock(&lru_pvecs.lock);
+ lru_cache_lock(&lru_pvecs.locks);
pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
get_page(page);
if (pagevec_add_and_need_flush(pvec, page))
pagevec_lru_move_fn(pvec, lru_deactivate_fn);
- local_unlock(&lru_pvecs.lock);
+ lru_cache_unlock(&lru_pvecs.locks);
}
}

@@ -693,28 +793,28 @@ void mark_page_lazyfree(struct page *page)
!PageSwapCache(page) && !PageUnevictable(page)) {
struct pagevec *pvec;

- local_lock(&lru_pvecs.lock);
+ lru_cache_lock(&lru_pvecs.locks);
pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
get_page(page);
if (pagevec_add_and_need_flush(pvec, page))
pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
- local_unlock(&lru_pvecs.lock);
+ lru_cache_unlock(&lru_pvecs.locks);
}
}

void lru_add_drain(void)
{
- local_lock(&lru_pvecs.lock);
+ lru_cache_lock(&lru_pvecs.locks);
lru_add_drain_cpu(smp_processor_id());
- local_unlock(&lru_pvecs.lock);
+ lru_cache_unlock(&lru_pvecs.locks);
}

void lru_add_drain_cpu_zone(struct zone *zone)
{
- local_lock(&lru_pvecs.lock);
+ lru_cache_lock(&lru_pvecs.locks);
lru_add_drain_cpu(smp_processor_id());
drain_local_pages(zone);
- local_unlock(&lru_pvecs.lock);
+ lru_cache_unlock(&lru_pvecs.locks);
}

#ifdef CONFIG_SMP
--
2.31.1

2021-09-21 16:16:39

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH 3/6] mm/swap: Allow remote LRU cache draining

Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __lru_add_drain_all(). So introduce new a mechanism
to remotely drain the per-cpu lists. It is made possible by using a more
constraining locking scheme, which is disabled by default and can be
enabled through the 'remote_pcpu_cache_access' static key. Regular users
shouldn't see any functional or performance change. Upcoming patches
will make use of the key.

Based on previous work by Thomas Gleixner, Anna-Maria Gleixner, and
Sebastian Andrzej Siewior[1].

[1] https://patchwork.kernel.org/project/linux-mm/cover/[email protected]/
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---
mm/swap.c | 32 +++++++++++++++++++++-----------
1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index bcf73bd563a6..59e96a2520d5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -915,19 +915,29 @@ inline void __lru_add_drain_all(bool force_all_cpus)
WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
smp_mb();

- cpumask_clear(&has_work);
- for_each_online_cpu(cpu) {
- struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
-
- if (force_all_cpus || lru_cpu_needs_drain(cpu)) {
- INIT_WORK(work, lru_add_drain_per_cpu);
- queue_work_on(cpu, mm_percpu_wq, work);
- __cpumask_set_cpu(cpu, &has_work);
+ if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ for_each_online_cpu(cpu) {
+ if (force_all_cpus || lru_cpu_needs_drain(cpu)) {
+ lru_cache_lock_cpu(&lru_pvecs.locks, cpu);
+ lru_add_drain_cpu(cpu);
+ lru_cache_unlock_cpu(&lru_pvecs.locks, cpu);
+ }
+ }
+ } else {
+ cpumask_clear(&has_work);
+ for_each_online_cpu(cpu) {
+ struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
+
+ if (force_all_cpus || lru_cpu_needs_drain(cpu)) {
+ INIT_WORK(work, lru_add_drain_per_cpu);
+ queue_work_on(cpu, mm_percpu_wq, work);
+ __cpumask_set_cpu(cpu, &has_work);
+ }
}
- }

- for_each_cpu(cpu, &has_work)
- flush_work(&per_cpu(lru_add_drain_work, cpu));
+ for_each_cpu(cpu, &has_work)
+ flush_work(&per_cpu(lru_add_drain_work, cpu));
+ }

done:
mutex_unlock(&lock);
--
2.31.1

2021-09-21 16:17:18

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH 5/6] mm/page_alloc: Allow remote per-cpu page list draining

Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages(). So introduce new a mechanism
to remotely drain the per-cpu lists. It is made possible by using a more
constraining locking scheme, which is disabled by default and can be
enabled through the 'remote_pcpu_cache_access' static key. Regular users
shouldn't see any functional or performance change. Upcoming patches
will make use of the static key.

This is based on previous work by Thomas Gleixner, Anna-Maria Gleixner,
and Sebastian Andrzej Siewior[1].

[1] https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---
mm/page_alloc.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3244eb2ab51b..717df675ea06 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3268,15 +3268,25 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
cpumask_clear_cpu(cpu, &cpus_with_pcps);
}

- for_each_cpu(cpu, &cpus_with_pcps) {
- struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
+ if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ for_each_cpu(cpu, &cpus_with_pcps) {
+ if (zone) {
+ drain_pages_zone(cpu, zone);
+ } else {
+ drain_pages(cpu);
+ }
+ }
+ } else {
+ for_each_cpu(cpu, &cpus_with_pcps) {
+ struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);

- drain->zone = zone;
- INIT_WORK(&drain->work, drain_local_pages_wq);
- queue_work_on(cpu, mm_percpu_wq, &drain->work);
+ drain->zone = zone;
+ INIT_WORK(&drain->work, drain_local_pages_wq);
+ queue_work_on(cpu, mm_percpu_wq, &drain->work);
+ }
+ for_each_cpu(cpu, &cpus_with_pcps)
+ flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
}
- for_each_cpu(cpu, &cpus_with_pcps)
- flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);

mutex_unlock(&pcpu_drain_mutex);
}
--
2.31.1

2021-09-21 16:17:27

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH 6/6] sched/isolation: Enable 'remote_pcpu_cache_access' on NOHZ_FULL systems

When enabled, 'remote_pcpu_cache_access' allows for remote draining of
mm/swap.c's per-cpu LRU caches and mm/page_alloc.c's per-cpu page lists
instead of using per-cpu drain work. This comes at the cost of more
constraining locking, but NOHZ_FULL setups need this nonetheless:
processes running on isolated CPUs are sensitive to any sort of
interruption and preempting them in order to satisfy a housekeeping task
is bound to break their functional guarantees (i.e. latency, bandwidth,
etc...).

So enable 'remote_pcpu_cache_access' after having successfully
initialized NOHZ_FULL.

This is based on previous work by Thomas Gleixner, Anna-Maria Gleixner,
and Sebastian Andrzej Siewior[1].

[1] https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---
kernel/sched/isolation.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 7f06eaf12818..4fc4c76f27ab 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -8,6 +8,7 @@
*
*/
#include "sched.h"
+#include "../../mm/internal.h"

DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
EXPORT_SYMBOL_GPL(housekeeping_overridden);
@@ -137,11 +138,17 @@ static int __init housekeeping_setup(char *str, enum hk_flags flags)
static int __init housekeeping_nohz_full_setup(char *str)
{
unsigned int flags;
+ int ret;

flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
HK_FLAG_MISC | HK_FLAG_KTHREAD;

- return housekeeping_setup(str, flags);
+ ret = housekeeping_setup(str, flags);
+ if (ret)
+ /* Avoid LRU cache and mm/page_alloc.c's pcplists drain work */
+ static_branch_enable(&remote_pcpu_cache_access);
+
+ return ret;
}
__setup("nohz_full=", housekeeping_nohz_full_setup);

--
2.31.1

2021-09-21 16:19:00

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH 4/6] mm/page_alloc: Introduce alternative per-cpu list locking

page_alloc.c's per-cpu page lists are currently protected using local
locks. While performance savvy, this doesn't allow for remote access to
these structures. CPUs requiring system-wide per-cpu list drains get
around this by scheduling drain work on all CPUs. That said, some select
setups like systems with NOHZ_FULL CPUs, aren't well suited to this, as
they can't handle interruptions of any sort.

To mitigate this, introduce an alternative locking scheme using
spinlocks that will permit remotely accessing these per-cpu page lists.
It's disabled by default, with no functional change to regular users,
and enabled through the 'remote_pcpu_cache_access' static key. Upcoming
patches will make use of this static key.

This is based on previous work by Thomas Gleixner, Anna-Maria Gleixner,
and Sebastian Andrzej Siewior[1].

[1] https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---
mm/page_alloc.c | 87 ++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 68 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b610b05d9b8..3244eb2ab51b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -123,10 +123,12 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)

struct pagesets {
- local_lock_t lock;
+ local_lock_t local;
+ spinlock_t spin;
};
static DEFINE_PER_CPU(struct pagesets, pagesets) = {
- .lock = INIT_LOCAL_LOCK(lock),
+ .local = INIT_LOCAL_LOCK(pagesets.local),
+ .spin = __SPIN_LOCK_UNLOCKED(pagesets.spin),
};

#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
@@ -207,6 +209,52 @@ static int __init early_init_on_free(char *buf)
}
early_param("init_on_free", early_init_on_free);

+static inline void pagesets_lock_irqsave(struct pagesets *locks,
+ unsigned long *flagsp)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ /* Avoid migration between this_cpu_ptr() and spin_lock_irqsave() */
+ migrate_disable();
+ spin_lock_irqsave(this_cpu_ptr(&locks->spin), *flagsp);
+ } else {
+ local_lock_irqsave(&locks->local, *flagsp);
+ }
+}
+
+/*
+ * pagesets_lock_irqsave_cpu() should only be used from remote CPUs when
+ * 'remote_pcpu_cache_access' is enabled or the target CPU is dead. Otherwise,
+ * it can still be called on the local CPU with migration disabled.
+ */
+static inline void pagesets_lock_irqsave_cpu(struct pagesets *locks,
+ unsigned long *flagsp, int cpu)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access))
+ spin_lock_irqsave(per_cpu_ptr(&locks->spin, cpu), *flagsp);
+ else
+ local_lock_irqsave(&locks->local, *flagsp);
+}
+
+static inline void pagesets_unlock_irqrestore(struct pagesets *locks,
+ unsigned long flags)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ spin_unlock_irqrestore(this_cpu_ptr(&locks->spin), flags);
+ migrate_enable();
+ } else {
+ local_unlock_irqrestore(&locks->local, flags);
+ }
+}
+
+static inline void pagesets_unlock_irqrestore_cpu(struct pagesets *locks,
+ unsigned long flags, int cpu)
+{
+ if (static_branch_unlikely(&remote_pcpu_cache_access))
+ spin_unlock_irqrestore(per_cpu_ptr(&locks->spin, cpu), flags);
+ else
+ local_unlock_irqrestore(&locks->local, flags);
+}
+
/*
* A cached value of the page's pageblock's migratetype, used when the page is
* put on a pcplist. Used to avoid the pageblock migratetype lookup when
@@ -3064,12 +3112,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
unsigned long flags;
int to_drain, batch;

- local_lock_irqsave(&pagesets.lock, flags);
+ pagesets_lock_irqsave(&pagesets, &flags);
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
if (to_drain > 0)
free_pcppages_bulk(zone, to_drain, pcp);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ pagesets_unlock_irqrestore(&pagesets, flags);
}
#endif

@@ -3077,21 +3125,22 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
* Drain pcplists of the indicated processor and zone.
*
* The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
+ * thread pinned to the current processor, a processor that
+ * is not online, or a remote processor while 'remote_pcpu_cache_access' is
+ * enabled.
*/
static void drain_pages_zone(unsigned int cpu, struct zone *zone)
{
unsigned long flags;
struct per_cpu_pages *pcp;

- local_lock_irqsave(&pagesets.lock, flags);
+ pagesets_lock_irqsave_cpu(&pagesets, &flags, cpu);

pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
if (pcp->count)
free_pcppages_bulk(zone, pcp->count, pcp);

- local_unlock_irqrestore(&pagesets.lock, flags);
+ pagesets_unlock_irqrestore_cpu(&pagesets, flags, cpu);
}

/*
@@ -3402,9 +3451,9 @@ void free_unref_page(struct page *page, unsigned int order)
migratetype = MIGRATE_MOVABLE;
}

- local_lock_irqsave(&pagesets.lock, flags);
+ pagesets_lock_irqsave(&pagesets, &flags);
free_unref_page_commit(page, pfn, migratetype, order);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ pagesets_unlock_irqrestore(&pagesets, flags);
}

/*
@@ -3439,7 +3488,7 @@ void free_unref_page_list(struct list_head *list)
set_page_private(page, pfn);
}

- local_lock_irqsave(&pagesets.lock, flags);
+ pagesets_lock_irqsave(&pagesets, &flags);
list_for_each_entry_safe(page, next, list, lru) {
pfn = page_private(page);
set_page_private(page, 0);
@@ -3460,12 +3509,12 @@ void free_unref_page_list(struct list_head *list)
* a large list of pages to free.
*/
if (++batch_count == SWAP_CLUSTER_MAX) {
- local_unlock_irqrestore(&pagesets.lock, flags);
+ pagesets_unlock_irqrestore(&pagesets, flags);
batch_count = 0;
- local_lock_irqsave(&pagesets.lock, flags);
+ pagesets_lock_irqsave(&pagesets, &flags);
}
}
- local_unlock_irqrestore(&pagesets.lock, flags);
+ pagesets_unlock_irqrestore(&pagesets, flags);
}

/*
@@ -3639,7 +3688,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
struct page *page;
unsigned long flags;

- local_lock_irqsave(&pagesets.lock, flags);
+ pagesets_lock_irqsave(&pagesets, &flags);

/*
* On allocation, reduce the number of pages that are batch freed.
@@ -3650,7 +3699,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
pcp->free_factor >>= 1;
list = &pcp->lists[order_to_pindex(migratetype, order)];
page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
- local_unlock_irqrestore(&pagesets.lock, flags);
+ pagesets_unlock_irqrestore(&pagesets, flags);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
zone_statistics(preferred_zone, zone, 1);
@@ -5270,7 +5319,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto failed;

/* Attempt the batch allocation */
- local_lock_irqsave(&pagesets.lock, flags);
+ pagesets_lock_irqsave(&pagesets, &flags);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];

@@ -5300,7 +5349,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

- local_unlock_irqrestore(&pagesets.lock, flags);
+ pagesets_unlock_irqrestore(&pagesets, flags);

__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
@@ -5309,7 +5358,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
return nr_populated;

failed_irq:
- local_unlock_irqrestore(&pagesets.lock, flags);
+ pagesets_unlock_irqrestore(&pagesets, flags);

failed:
page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
--
2.31.1

2021-09-21 17:56:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support

On Tue, 21 Sep 2021 18:13:18 +0200 Nicolas Saenz Julienne <[email protected]> wrote:

> This series introduces an alternative locking scheme around mm/swap.c's per-cpu
> LRU pagevec caches and mm/page_alloc.c's per-cpu page lists which will allow
> for remote CPUs to drain them. Currently, only a local CPU is permitted to
> change its per-cpu lists, and it's expected to do so, on-demand, whenever a
> process demands it (by means of queueing an drain task on the local CPU). Most
> systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
> that can't take any sort of interruption without breaking their functional
> guarantees (latency, bandwidth, etc...). Having a way for these processes to
> remotely drain the lists themselves will make co-existing with isolated CPUs
> possible, at the cost of more constraining locks.
>
> Fortunately for non-NOHZ_FULL users, the alternative locking scheme and remote
> drain code are conditional to a static key which is disabled by default. This
> guarantees minimal functional or performance regressions. The feature will only
> be enabled if NOHZ_FULL's initialization process was successful.

That all looks pretty straightforward. Obvious problems are:

- Very little test coverage for the spinlocked code paths. Virtually
all test setups will be using local_lock() and the code path you care
about will go untested.

I hope that whoever does test the spinlock version will be running
full debug kernels, including lockdep. Because adding a spinlock
where the rest of the code expects local_lock might introduce
problems.

A fix for all of this would be to enable the spin_lock code paths
to be tested more widely. Perhaps you could add a boot-time kernel
parameter (or, not as good, a Kconfig thing) which forces the use of
the spinlock code even on non-NOHZ_FULL systems.

Or perhaps this debug/testing mode _should_ be enabled by Kconfig,
so kernel fuzzers sometimes turn it on.

Please have a think about all of this?

- Maintainability. Few other MM developers will think about this new
spinlocked mode much, and they are unlikely to runtime test the
spinlock mode. Adding the force-spinlocks-mode-on knob will help
with this.


2021-09-21 18:04:15

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support

On 9/21/21 6:13 PM, Nicolas Saenz Julienne wrote:
> This series introduces an alternative locking scheme around mm/swap.c's per-cpu
> LRU pagevec caches and mm/page_alloc.c's per-cpu page lists which will allow
> for remote CPUs to drain them. Currently, only a local CPU is permitted to
> change its per-cpu lists, and it's expected to do so, on-demand, whenever a
> process demands it (by means of queueing an drain task on the local CPU). Most
> systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
> that can't take any sort of interruption without breaking their functional
> guarantees (latency, bandwidth, etc...). Having a way for these processes to
> remotely drain the lists themselves will make co-existing with isolated CPUs
> possible, at the cost of more constraining locks.
>
> Fortunately for non-NOHZ_FULL users, the alternative locking scheme and remote
> drain code are conditional to a static key which is disabled by default. This
> guarantees minimal functional or performance regressions. The feature will only
> be enabled if NOHZ_FULL's initialization process was successful.
>
> This work is based on a previous series by Thomas Gleixner, Anna-Maria
> Gleixner, and Sebastian Andrzej Siewior[1].

These days the pcplist protection is done by local_lock, which solved
the RT concerns. Probably a stupid/infeasible idea, but maybe what you
want to achieve could be more generally solved at the local_lock level?
That on NOHZ_FULL CPUs, local_locks could have this mode where they
could synchronize with remote cpus?

> [1] https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
>
> Nicolas Saenz Julienne (6):
> mm/swap: Introduce lru_cpu_needs_drain()
> mm/swap: Introduce alternative per-cpu LRU cache locking
> mm/swap: Allow remote LRU cache draining
> mm/page_alloc: Introduce alternative per-cpu list locking
> mm/page_alloc: Allow remote per-cpu page list draining
> sched/isolation: Enable 'remote_pcpu_cache_access' on NOHZ_FULL
> systems
>
> kernel/sched/isolation.c | 9 +-
> mm/internal.h | 2 +
> mm/page_alloc.c | 111 ++++++++++++++++-----
> mm/swap.c | 202 ++++++++++++++++++++++++++++++---------
> 4 files changed, 253 insertions(+), 71 deletions(-)
>

2021-09-22 00:11:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/swap: Introduce alternative per-cpu LRU cache locking

On Tue, Sep 21, 2021 at 06:13:20PM +0200, Nicolas Saenz Julienne wrote:
> +static inline void lru_cache_lock(struct lru_cache_locks *locks)
> +{
> + if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> + /* Avoid migration between this_cpu_ptr() and spin_lock() */
> + migrate_disable();
> + spin_lock(this_cpu_ptr(&locks->spin));
> + } else {
> + local_lock(&locks->local);
> + }
> +}

> +static inline void lru_cache_unlock(struct lru_cache_locks *locks)
> +{
> + if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> + spin_unlock(this_cpu_ptr(&locks->spin));
> + migrate_enable();
> + } else {
> + local_unlock(&locks->local);
> + }
> +}

*why* use migrate_disable(), that's horrible!

2021-09-22 08:48:45

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/swap: Introduce alternative per-cpu LRU cache locking

On Wed, 2021-09-22 at 00:03 +0200, Peter Zijlstra wrote:
> On Tue, Sep 21, 2021 at 06:13:20PM +0200, Nicolas Saenz Julienne wrote:
> > +static inline void lru_cache_lock(struct lru_cache_locks *locks)
> > +{
> > + if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > + /* Avoid migration between this_cpu_ptr() and spin_lock() */
> > + migrate_disable();
> > + spin_lock(this_cpu_ptr(&locks->spin));
> > + } else {
> > + local_lock(&locks->local);
> > + }
> > +}
>
> > +static inline void lru_cache_unlock(struct lru_cache_locks *locks)
> > +{
> > + if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > + spin_unlock(this_cpu_ptr(&locks->spin));
> > + migrate_enable();
> > + } else {
> > + local_unlock(&locks->local);
> > + }
> > +}
>
> *why* use migrate_disable(), that's horrible!

I was trying to be mindful of RT. They don't appreciate people taking spinlocks
just after having disabled preemption.

I think getting local_lock(&locks->local) is my only option then. But it adds
an extra redundant spinlock in the RT+NOHZ_FULL case.

--
Nicolás Sáenz

Subject: Re: [PATCH 2/6] mm/swap: Introduce alternative per-cpu LRU cache locking

On 2021-09-22 10:47:07 [+0200], [email protected] wrote:
> > *why* use migrate_disable(), that's horrible!
>
> I was trying to be mindful of RT. They don't appreciate people taking spinlocks
> just after having disabled preemption.
>
> I think getting local_lock(&locks->local) is my only option then. But it adds
> an extra redundant spinlock in the RT+NOHZ_FULL case.

spin_lock() does not disable preemption on PREEMPT_RT. You don't
disables preemption on purpose or did I miss that?

Sebastian

2021-09-22 09:52:02

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/swap: Introduce alternative per-cpu LRU cache locking

On Wed, 2021-09-22 at 11:20 +0200, Sebastian Andrzej Siewior wrote:
> On 2021-09-22 10:47:07 [+0200], [email protected] wrote:
> > > *why* use migrate_disable(), that's horrible!
> >
> > I was trying to be mindful of RT. They don't appreciate people taking spinlocks
> > just after having disabled preemption.
> >
> > I think getting local_lock(&locks->local) is my only option then. But it adds
> > an extra redundant spinlock in the RT+NOHZ_FULL case.
>
> spin_lock() does not disable preemption on PREEMPT_RT. You don't
> disables preemption on purpose or did I miss that?

Sorry my message wasn't clear. Adding code for context:

+ static inline void lru_cache_lock(struct lru_cache_locks *locks)
+ {
+ if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ /* Avoid migration between this_cpu_ptr() and spin_lock() */
+ migrate_disable();

IIUC PeterZ would've preferred that I disable preemption here. And what I meant
to explain is that, given the spinlock below, I choose migrate_disable() over
preempt_disable() to cater for RT.

+ spin_lock(this_cpu_ptr(&locks->spin));
+ } else {


So, to make both worlds happy, I think the only option left is using the
local_lock (at the expense of extra overhead in the RT+NOHZ_FULL case):

+ if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ /* Local lock avoids migration between this_cpu_ptr() and spin_lock() */
+ local_lock(&locks->local);
+ spin_lock(this_cpu_ptr(&locks->spin));
+ } else {

--
Nicolás Sáenz

2021-09-22 11:30:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support

On Tue, Sep 21, 2021 at 07:59:51PM +0200, Vlastimil Babka wrote:

> These days the pcplist protection is done by local_lock, which solved
> the RT concerns. Probably a stupid/infeasible idea, but maybe what you
> want to achieve could be more generally solved at the local_lock level?
> That on NOHZ_FULL CPUs, local_locks could have this mode where they
> could synchronize with remote cpus?

local_lock and spinlock have different rules, local_lock for example can
never cause an irq inversion, unlike a spinlock.

2021-09-22 11:39:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/swap: Introduce alternative per-cpu LRU cache locking

On Wed, Sep 22, 2021 at 10:47:07AM +0200, [email protected] wrote:
> On Wed, 2021-09-22 at 00:03 +0200, Peter Zijlstra wrote:
> > On Tue, Sep 21, 2021 at 06:13:20PM +0200, Nicolas Saenz Julienne wrote:
> > > +static inline void lru_cache_lock(struct lru_cache_locks *locks)
> > > +{
> > > + if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > > + /* Avoid migration between this_cpu_ptr() and spin_lock() */
> > > + migrate_disable();
> > > + spin_lock(this_cpu_ptr(&locks->spin));
> > > + } else {
> > > + local_lock(&locks->local);
> > > + }
> > > +}
> >
> > > +static inline void lru_cache_unlock(struct lru_cache_locks *locks)
> > > +{
> > > + if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > > + spin_unlock(this_cpu_ptr(&locks->spin));
> > > + migrate_enable();
> > > + } else {
> > > + local_unlock(&locks->local);
> > > + }
> > > +}
> >
> > *why* use migrate_disable(), that's horrible!
>
> I was trying to be mindful of RT. They don't appreciate people taking spinlocks
> just after having disabled preemption.
>
> I think getting local_lock(&locks->local) is my only option then. But it adds
> an extra redundant spinlock in the RT+NOHZ_FULL case.

That doesn't make it less horrible. The fundamental problem you seem to
have is that you have to do the this_cpu thing multiple times.

If instead you make sure to only ever do the per-cpu deref *once* and
then make sure you use the same lru_rotate.pvec as you used
lru_rotate.locks, it all works out much nicer.

Then you can write things like:

struct lru_rotate *lr = raw_cpu_ptr(&lru_rotate);

frob_lock(&lr->locks);
frob_pvec(&lr->pvec);
frob_unlock(&lr->locks);

and it all no longer matters if you got this or that CPU's instance.

After all, you no longer rely on per-cpu ness for serialization but the
lock.

2021-09-22 11:44:55

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/swap: Introduce alternative per-cpu LRU cache locking

On Wed, 2021-09-22 at 13:37 +0200, Peter Zijlstra wrote:
> On Wed, Sep 22, 2021 at 10:47:07AM +0200, [email protected] wrote:
> > On Wed, 2021-09-22 at 00:03 +0200, Peter Zijlstra wrote:
> > > On Tue, Sep 21, 2021 at 06:13:20PM +0200, Nicolas Saenz Julienne wrote:
> > > > +static inline void lru_cache_lock(struct lru_cache_locks *locks)
> > > > +{
> > > > + if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > > > + /* Avoid migration between this_cpu_ptr() and spin_lock() */
> > > > + migrate_disable();
> > > > + spin_lock(this_cpu_ptr(&locks->spin));
> > > > + } else {
> > > > + local_lock(&locks->local);
> > > > + }
> > > > +}
> > >
> > > > +static inline void lru_cache_unlock(struct lru_cache_locks *locks)
> > > > +{
> > > > + if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > > > + spin_unlock(this_cpu_ptr(&locks->spin));
> > > > + migrate_enable();
> > > > + } else {
> > > > + local_unlock(&locks->local);
> > > > + }
> > > > +}
> > >
> > > *why* use migrate_disable(), that's horrible!
> >
> > I was trying to be mindful of RT. They don't appreciate people taking spinlocks
> > just after having disabled preemption.
> >
> > I think getting local_lock(&locks->local) is my only option then. But it adds
> > an extra redundant spinlock in the RT+NOHZ_FULL case.
>
> That doesn't make it less horrible. The fundamental problem you seem to
> have is that you have to do the this_cpu thing multiple times.
>
> If instead you make sure to only ever do the per-cpu deref *once* and
> then make sure you use the same lru_rotate.pvec as you used
> lru_rotate.locks, it all works out much nicer.
>
> Then you can write things like:
>
> struct lru_rotate *lr = raw_cpu_ptr(&lru_rotate);
>
> frob_lock(&lr->locks);
> frob_pvec(&lr->pvec);
> frob_unlock(&lr->locks);
>
> and it all no longer matters if you got this or that CPU's instance.
>
> After all, you no longer rely on per-cpu ness for serialization but the
> lock.

Thanks, understood. I'll look into it.

--
Nicolás Sáenz

2021-09-22 22:12:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support

On Wed, Sep 22 2021 at 13:28, Peter Zijlstra wrote:
> On Tue, Sep 21, 2021 at 07:59:51PM +0200, Vlastimil Babka wrote:
>
>> These days the pcplist protection is done by local_lock, which solved
>> the RT concerns. Probably a stupid/infeasible idea, but maybe what you
>> want to achieve could be more generally solved at the local_lock level?
>> That on NOHZ_FULL CPUs, local_locks could have this mode where they
>> could synchronize with remote cpus?
>
> local_lock and spinlock have different rules, local_lock for example can
> never cause an irq inversion, unlike a spinlock.

TBH, I really regret an attempt I made at some point in the RT
development to abuse local locks for this kind of cross CPU protections
because that led to yet another semantically ill defined construct.

local locks are as the name says strictly local. IOW, they do not exist
for remote access. Just look at the !RT mapping:

local_lock() -> preempt_disable()
local_lock_irq() -> local_irq_disable()
...

The only thing local_lock is addressing is the opaque nature of
preempt_disable(), local*_disable/save() protected critical sections,
which have per CPU BKL, i.e. undefined protection scope, semantics.

If you really want cross CPU protection then using a regular spinlock in
a per CPU data structure is the right way to go.

That makes it a bit akward vs. the code at hand which already introduced
local locks to replace the opaque preempt/local_irq disabled critical
sections with scoped local locks which in turn allows RT to substitute
them with strict CPU local spinlocks.

But for clarity sake we really have to look at two different cases now:

1) Strict per CPU local protection

That's what the code does today via local lock and this includes
RT where the locality is preserved via the local lock semantics

I.e. for the !RT case the spinlock overhead is avoided

2) Global scoped per CPU protection

That's what Nicolas is trying to achieve to be able to update
data structures cross CPU for the sake of avoiding smp function
calls or queuing work on remote CPUs for the NOHZ_FULL isolation
use case.

That said, I completely understand Andrew's concerns versus these
distinctions and their test coverage.

In consequence the real interesting question here is whether any of
these use cases are sensible to the extra overhead of #2.

IOW, does it really matter whether the !RT and !NOHZ_FULL case take an
uncontended per CPU spinlock unconditionally instead of just disabling
preemption / interrupts?

The same question arises vs. the remote work queuing. Does it really
matter? I think that a proper investigation is due to figure out whether
delegated work is really superiour vs. doing the same work locked from a
remote CPU occasionally.

If you really think about it then the task which is initiating the work
is already in a slow path. So what's the tradeoff to make this path a
little bit slower due to remote access vs. scheduling work and thereby
disturbing the remote CPU which has a performance impact as well and in
the NOHZ_FULL case even a correctness impact? That's especially
questionable for situations where the initiator has to wait for
completion of the remote work.

The changelogs and the cover letter have a distinct void vs. that which
means this is just another example of 'scratch my itch' changes w/o
proper justification.

I'm well aware that the inital stuff which myself, and in consequence
Sebastian and Anna-Maria, were doing back then falls into the same
category. IOW, guilty as charged, but that does not make the current
approach more palatable than the one from years ago.

Thanks,

tglx

2021-09-23 07:14:11

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support

On 9/23/21 00:09, Thomas Gleixner wrote:
> On Wed, Sep 22 2021 at 13:28, Peter Zijlstra wrote:
>> On Tue, Sep 21, 2021 at 07:59:51PM +0200, Vlastimil Babka wrote:
>>
>>> These days the pcplist protection is done by local_lock, which solved
>>> the RT concerns. Probably a stupid/infeasible idea, but maybe what you
>>> want to achieve could be more generally solved at the local_lock level?
>>> That on NOHZ_FULL CPUs, local_locks could have this mode where they
>>> could synchronize with remote cpus?
>>
>> local_lock and spinlock have different rules, local_lock for example can
>> never cause an irq inversion, unlike a spinlock.
>
> TBH, I really regret an attempt I made at some point in the RT
> development to abuse local locks for this kind of cross CPU protections
> because that led to yet another semantically ill defined construct.
>
> local locks are as the name says strictly local. IOW, they do not exist
> for remote access. Just look at the !RT mapping:
>
> local_lock() -> preempt_disable()
> local_lock_irq() -> local_irq_disable()
> ...

Yes, to be clean, this would have to be a new primitive, not just an abused
local lock. It would just look similar to the RT version (a percpu array of
spinlocks), so for this patchset it would allow not to have two such locks
side be side (local + spin) while only one is being used. For maximum
flexibility the initialization would take a CONFIG_ (or something
compile-time bool) that when false would make the !RT version an empty
struct and "locking" would rely on preempt/irq disable (just as with !RT
local_lock). If compile-time true it would take a static key to decide on
boot whether the !RT version only does the preepmt/irq disable or actually
takes the lock.

But as you say below, it's too much complexity for questionable benefit.

But maybe this can all be avoided anyway, as I recalled what we do for
vmstat already (IIUC). See quiet_vmstat() - when cpu enters the nohz mode,
it flushes per-cpu vmstat diffs and then there's no reason to further
disturb the cpu to do that while it's on NOHZ mode. We could do the same for
lru pagevecs and pcplists?

> The only thing local_lock is addressing is the opaque nature of
> preempt_disable(), local*_disable/save() protected critical sections,
> which have per CPU BKL, i.e. undefined protection scope, semantics.
>
> If you really want cross CPU protection then using a regular spinlock in
> a per CPU data structure is the right way to go.
>
> That makes it a bit akward vs. the code at hand which already introduced
> local locks to replace the opaque preempt/local_irq disabled critical
> sections with scoped local locks which in turn allows RT to substitute
> them with strict CPU local spinlocks.
>
> But for clarity sake we really have to look at two different cases now:
>
> 1) Strict per CPU local protection
>
> That's what the code does today via local lock and this includes
> RT where the locality is preserved via the local lock semantics
>
> I.e. for the !RT case the spinlock overhead is avoided
>
> 2) Global scoped per CPU protection
>
> That's what Nicolas is trying to achieve to be able to update
> data structures cross CPU for the sake of avoiding smp function
> calls or queuing work on remote CPUs for the NOHZ_FULL isolation
> use case.
>
> That said, I completely understand Andrew's concerns versus these
> distinctions and their test coverage.
>
> In consequence the real interesting question here is whether any of
> these use cases are sensible to the extra overhead of #2.
>
> IOW, does it really matter whether the !RT and !NOHZ_FULL case take an
> uncontended per CPU spinlock unconditionally instead of just disabling
> preemption / interrupts?
>
> The same question arises vs. the remote work queuing. Does it really
> matter? I think that a proper investigation is due to figure out whether
> delegated work is really superiour vs. doing the same work locked from a
> remote CPU occasionally.
>
> If you really think about it then the task which is initiating the work
> is already in a slow path. So what's the tradeoff to make this path a
> little bit slower due to remote access vs. scheduling work and thereby
> disturbing the remote CPU which has a performance impact as well and in
> the NOHZ_FULL case even a correctness impact? That's especially
> questionable for situations where the initiator has to wait for
> completion of the remote work.
>
> The changelogs and the cover letter have a distinct void vs. that which
> means this is just another example of 'scratch my itch' changes w/o
> proper justification.
>
> I'm well aware that the inital stuff which myself, and in consequence
> Sebastian and Anna-Maria, were doing back then falls into the same
> category. IOW, guilty as charged, but that does not make the current
> approach more palatable than the one from years ago.
>
> Thanks,
>
> tglx
>

2021-09-23 10:40:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support

Vlastimil,

On Thu, Sep 23 2021 at 09:12, Vlastimil Babka wrote:
> On 9/23/21 00:09, Thomas Gleixner wrote:
>> local_lock() -> preempt_disable()
>> local_lock_irq() -> local_irq_disable()
>> ...
>
> Yes, to be clean, this would have to be a new primitive, not just an abused
> local lock. It would just look similar to the RT version (a percpu array of
> spinlocks), so for this patchset it would allow not to have two such locks
> side be side (local + spin) while only one is being used. For maximum
> flexibility the initialization would take a CONFIG_ (or something
> compile-time bool) that when false would make the !RT version an empty
> struct and "locking" would rely on preempt/irq disable (just as with !RT
> local_lock). If compile-time true it would take a static key to decide on
> boot whether the !RT version only does the preepmt/irq disable or actually
> takes the lock.
>
> But as you say below, it's too much complexity for questionable benefit.
>
> But maybe this can all be avoided anyway, as I recalled what we do for
> vmstat already (IIUC). See quiet_vmstat() - when cpu enters the nohz mode,
> it flushes per-cpu vmstat diffs and then there's no reason to further
> disturb the cpu to do that while it's on NOHZ mode. We could do the same for
> lru pagevecs and pcplists?

I'm not sure about this. I like the idea of being able to offload things
to housekeeping CPUs not only in the full isolation case.

A good example is RCU which allows to offload all RCU processing to some
other CPU(s), which is useful even w/o full isolation.

The synchronous quiescing on entering NOHZ full mode is a cute
workaround but for one it makes entering NOHZ full more expensive and it
does not necessarily provide good isolation guarantees under all
circumstances, while a full remote processing definitely does.

I think it's at least worthwhile to investigate.

Thanks,

tglx

2021-09-27 09:33:20

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support

Hi Thomas,
thanks for the in-depth review.

On Thu, 2021-09-23 at 00:09 +0200, Thomas Gleixner wrote:
> On Wed, Sep 22 2021 at 13:28, Peter Zijlstra wrote:
> > On Tue, Sep 21, 2021 at 07:59:51PM +0200, Vlastimil Babka wrote:
> >
> > > These days the pcplist protection is done by local_lock, which solved
> > > the RT concerns. Probably a stupid/infeasible idea, but maybe what you
> > > want to achieve could be more generally solved at the local_lock level?
> > > That on NOHZ_FULL CPUs, local_locks could have this mode where they
> > > could synchronize with remote cpus?
> >
> > local_lock and spinlock have different rules, local_lock for example can
> > never cause an irq inversion, unlike a spinlock.
>
> TBH, I really regret an attempt I made at some point in the RT
> development to abuse local locks for this kind of cross CPU protections
> because that led to yet another semantically ill defined construct.
>
> local locks are as the name says strictly local. IOW, they do not exist
> for remote access. Just look at the !RT mapping:
>
> local_lock() -> preempt_disable()
> local_lock_irq() -> local_irq_disable()
> ...
>
> The only thing local_lock is addressing is the opaque nature of
> preempt_disable(), local*_disable/save() protected critical sections,
> which have per CPU BKL, i.e. undefined protection scope, semantics.
>
> If you really want cross CPU protection then using a regular spinlock in
> a per CPU data structure is the right way to go.
>
> That makes it a bit akward vs. the code at hand which already introduced
> local locks to replace the opaque preempt/local_irq disabled critical
> sections with scoped local locks which in turn allows RT to substitute
> them with strict CPU local spinlocks.
>
> But for clarity sake we really have to look at two different cases now:
>
> 1) Strict per CPU local protection
>
> That's what the code does today via local lock and this includes
> RT where the locality is preserved via the local lock semantics
>
> I.e. for the !RT case the spinlock overhead is avoided
>
> 2) Global scoped per CPU protection
>
> That's what Nicolas is trying to achieve to be able to update
> data structures cross CPU for the sake of avoiding smp function
> calls or queuing work on remote CPUs for the NOHZ_FULL isolation
> use case.
>
> That said, I completely understand Andrew's concerns versus these
> distinctions and their test coverage.
>
> In consequence the real interesting question here is whether any of
> these use cases are sensible to the extra overhead of #2.
>
> IOW, does it really matter whether the !RT and !NOHZ_FULL case take an
> uncontended per CPU spinlock unconditionally instead of just disabling
> preemption / interrupts?
>
> The same question arises vs. the remote work queuing. Does it really
> matter? I think that a proper investigation is due to figure out whether
> delegated work is really superiour vs. doing the same work locked from a
> remote CPU occasionally.
>
> If you really think about it then the task which is initiating the work
> is already in a slow path. So what's the tradeoff to make this path a
> little bit slower due to remote access vs. scheduling work and thereby
> disturbing the remote CPU which has a performance impact as well and in
> the NOHZ_FULL case even a correctness impact? That's especially
> questionable for situations where the initiator has to wait for
> completion of the remote work.
>
> The changelogs and the cover letter have a distinct void vs. that which
> means this is just another example of 'scratch my itch' changes w/o
> proper justification.

Point taken, I'll get to it.

--
Nicolás Sáenz