The problem:
We have a few scenarios in mm that make use of local_locks() +
schedule_work_on() due to reduced overhead on the most common local
references. This scenario is not ideal for RT workloads though: even on
isolated cpus, those tasks will schedule-out the sensitive RT workloads to
perform those jobs, and usually cause missing deadlines with this.
The idea:
In PREEMPT_RT, local_locks() end up becoming spinlocks(), so there should
be no harm in just getting another cpu's spinlock to perform the work
on the per-cpu structure: the cacheline invalidation will already happen
due to the usage by schedule_work_on(), and on local functions the locking
already happens anyway.
This will avoid schedule_work_on(), and thus avoid scheduling-out an
RT workload. Given the usually brief nature of those scheduled tasks, their
execution end up being faster than doing their scheduling.
======
While I really belive the solution, there are problems with this patchset,
which I need your suggestions for improvement:
1) Naming is horrible: I could not think on a good name for the new lock
functions, so I lazely named it local_lock_n().
The naming should have been making clear that we are in fact dealing
with a local_lock but it can in some scenarios be addressing another cpu's
local_lock, and thus we need the extra cpu parameter.
Dealing with this local & remote duality, all I thought was:
mostly_local_lock(), (or local_mostly_lock())
local_maybe_remote_lock() <- LOL
remote_local_lock()
per_cpu_local_lock()
local_lock_cpu()
Please help !
2) Maybe naming is this hard because the interface is not the best.
My idea was to create a simple interface to easily replace functions
already in use, but maybe there is something better I could not see.
Please contribute!
3) I am lazely setting work->data without atomic operations, which may
be bad in some scenario. If so, even thought it can be costly, since it
happens outside of the hotpath (local per-cpu areas) it should be no
problem adding the atomic operation for non-RT kernels.
For RT kernels, since the whole operation hapens locally, there should be
no need of using the atomic set.
Please let me know of any idea, or suggestion that can improve this RFC.
Thanks a lot!
Leo
Leonardo Bras (4):
Introducing local_lock_n() and local queue & flush
swap: apply new local_schedule_work_on() interface
memcontrol: apply new local_schedule_work_on() interface
slub: apply new local_schedule_work_on() interface
include/linux/local_lock.h | 18 ++++++++++
include/linux/local_lock_internal.h | 52 +++++++++++++++++++++++++++++
mm/memcontrol.c | 17 ++++++----
mm/slub.c | 17 ++++++----
mm/swap.c | 18 +++++-----
5 files changed, 100 insertions(+), 22 deletions(-)
--
2.41.0
Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps the overhead low since
cacheline tends to be mostly local (and have no locks in non-RT kernels),
and the few remote operations will be more costly due to scheduling.
On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with some unrelated task is
sure to introduce unexpected deadline misses.
It's interesting, though, that local_lock()s in RT kernels become an
spinlock(), so we can use this locking cost (that is already being paid) in
order to avoid scheduling work on a remote cpu, and updating another cpu's
per_cpu structure from the current cpu, while holding it's spinlock().
In order to do that, it's necessary to introduce a new set of functions to
make it possible to get another cpu's local lock (local_*lock_n*()), and
also the corresponding local_queue_work_on() and local_flush_work()
helpers.
On non-RT kernels, every local*_n*() works the exactly same as the non-n
functions (the extra parameter is ignored), and both local_queue_work_on()
and local_flush_work() call their non-local versions.
For RT kernels, though, local_*lock_n*() will use the extra cpu parameter
to select the correct per-cpu structure to work on, and acquire the
spinlock for that cpu.
local_queue_work_on() will just call the requested function in the current
cpu: since the local_locks() are spinlocks() we are safe.
local_flush_work() then becomes a no-op since no work is actually scheduled
on a remote cpu.
Some minimal code rework is needed in order to make this mechanism work:
The calls for local_*lock*() on the functions that are currently scheduled
on remote cpus need to be replaced my local_*lock_n*(), so in RT kernels
they can reference a different cpu.
This should have almost no impact on non-RT kernels: few this_cpu_ptr()
will become per_cpu_ptr(,smp_processor_id()).
On RT kernels, this should improve performance and reduces latency by
removing scheduling noise.
Signed-off-by: Leonardo Bras <[email protected]>
---
include/linux/local_lock.h | 18 ++++++++++
include/linux/local_lock_internal.h | 52 +++++++++++++++++++++++++++++
2 files changed, 70 insertions(+)
diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index e55010fa7329..f1fa1e8e3fbc 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -51,4 +51,22 @@
#define local_unlock_irqrestore(lock, flags) \
__local_unlock_irqrestore(lock, flags)
+#define local_lock_n(lock, cpu) \
+ __local_lock_n(lock, cpu)
+
+#define local_unlock_n(lock, cpu) \
+ __local_unlock_n(lock, cpu)
+
+#define local_lock_irqsave_n(lock, flags, cpu) \
+ __local_lock_irqsave_n(lock, flags, cpu)
+
+#define local_unlock_irqrestore_n(lock, flags, cpu) \
+ __local_unlock_irqrestore_n(lock, flags, cpu)
+
+#define local_queue_work_on(cpu, wq, work) \
+ __local_queue_work_on(cpu, wq, work)
+
+#define local_flush_work(work) \
+ __local_flush_work(work)
+
#endif
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 975e33b793a7..df064149fff8 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -98,6 +98,25 @@ do { \
local_irq_restore(flags); \
} while (0)
+#define __local_lock_n(lock, cpu) __local_lock(lock)
+#define __local_unlock_n(lock, cpu) __local_unlock(lock)
+
+#define __local_lock_irqsave_n(lock, flags, cpu) \
+ __local_lock_irqsave(lock, flags)
+
+#define __local_unlock_irqrestore_n(lock, flags, cpu) \
+ __local_unlock_irqrestore(lock, flags)
+
+#define __local_queue_work_on(cpu, wq, work) \
+ do { \
+ typeof(cpu) __cpu = cpu; \
+ typeof(work) __work = work; \
+ __work->data.counter = __cpu; \
+ queue_work_on(__cpu, wq, __work); \
+ } while (0)
+
+#define __local_flush_work(work) flush_work(work)
+
#else /* !CONFIG_PREEMPT_RT */
/*
@@ -138,4 +157,37 @@ typedef spinlock_t local_lock_t;
#define __local_unlock_irqrestore(lock, flags) __local_unlock(lock)
+#define __local_lock_n(__lock, cpu) \
+ do { \
+ migrate_disable(); \
+ spin_lock(per_cpu_ptr((__lock)), cpu); \
+ } while (0)
+
+#define __local_unlock_n(__lock, cpu) \
+ do { \
+ spin_unlock(per_cpu_ptr((__lock)), cpu); \
+ migrate_enable(); \
+ } while (0)
+
+#define __local_lock_irqsave_n(lock, flags, cpu) \
+ do { \
+ typecheck(unsigned long, flags); \
+ flags = 0; \
+ __local_lock_n(lock, cpu); \
+ } while (0)
+
+#define __local_unlock_irqrestore_n(lock, flags, cpu) \
+ __local_unlock_n(lock, cpu)
+
+#define __local_queue_work_on(cpu, wq, work) \
+ do { \
+ typeof(cpu) __cpu = cpu; \
+ typeof(work) __work = work; \
+ __work->data = (typeof(__work->data))__cpu; \
+ __work->func(__work); \
+ } while (0)
+
+#define __local_flush_work(work) \
+ do {} while (0)
+
#endif /* CONFIG_PREEMPT_RT */
--
2.41.0
Make use of the new local_*lock_n*() and local_schedule_work_on() interface
to improve performance & latency on PREEMTP_RT kernels.
For functions that may be scheduled in a different cpu, replace
local_*lock*() by local_lock_n*(), and replace schedule_work_on() by
local_schedule_work_on(). The same happens for flush_work() and
local_flush_work().
This should bring no relevant performance impact on non-RT kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).
Signed-off-by: Leonardo Bras <[email protected]>
---
mm/swap.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..a79f2091eae5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -760,11 +760,11 @@ void lru_add_drain(void)
* the same cpu. It shouldn't be a problem in !SMP case since
* the core is only one and the locks will disable preemption.
*/
-static void lru_add_and_bh_lrus_drain(void)
+static void lru_add_and_bh_lrus_drain(int cpu)
{
- local_lock(&cpu_fbatches.lock);
- lru_add_drain_cpu(smp_processor_id());
- local_unlock(&cpu_fbatches.lock);
+ local_lock_n(&cpu_fbatches.lock, cpu);
+ lru_add_drain_cpu(cpu);
+ local_unlock_n(&cpu_fbatches.lock, cpu);
invalidate_bh_lrus_cpu();
mlock_drain_local();
}
@@ -782,9 +782,9 @@ void lru_add_drain_cpu_zone(struct zone *zone)
static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
-static void lru_add_drain_per_cpu(struct work_struct *dummy)
+static void lru_add_drain_per_cpu(struct work_struct *w)
{
- lru_add_and_bh_lrus_drain();
+ lru_add_and_bh_lrus_drain(w->data.counter);
}
static bool cpu_needs_drain(unsigned int cpu)
@@ -888,13 +888,13 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
if (cpu_needs_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
- queue_work_on(cpu, mm_percpu_wq, work);
+ local_queue_work_on(cpu, mm_percpu_wq, work);
__cpumask_set_cpu(cpu, &has_work);
}
}
for_each_cpu(cpu, &has_work)
- flush_work(&per_cpu(lru_add_drain_work, cpu));
+ local_flush_work(&per_cpu(lru_add_drain_work, cpu));
done:
mutex_unlock(&lock);
@@ -941,7 +941,7 @@ void lru_cache_disable(void)
#ifdef CONFIG_SMP
__lru_add_drain_all(true);
#else
- lru_add_and_bh_lrus_drain();
+ lru_add_and_bh_lrus_drain(smp_processor_id());
#endif
}
--
2.41.0
Make use of the new local_*lock_n*() and local_schedule_work_on() interface
to improve performance & latency on PREEMTP_RT kernels.
For functions that may be scheduled in a different cpu, replace
local_*lock*() by local_lock_n*(), and replace schedule_work_on() by
local_schedule_work_on().
This should bring no relevant performance impact on non-RT kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).
Signed-off-by: Leonardo Bras <[email protected]>
---
mm/memcontrol.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..6d4fa48d75e3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2277,7 +2277,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
WRITE_ONCE(stock->cached, NULL);
}
-static void drain_local_stock(struct work_struct *dummy)
+static void _drain_local_stock(int cpu)
{
struct memcg_stock_pcp *stock;
struct obj_cgroup *old = NULL;
@@ -2288,18 +2288,23 @@ static void drain_local_stock(struct work_struct *dummy)
* drain_stock races is that we always operate on local CPU stock
* here with IRQ disabled
*/
- local_lock_irqsave(&memcg_stock.stock_lock, flags);
+ local_lock_irqsave_n(&memcg_stock.stock_lock, flags, cpu);
- stock = this_cpu_ptr(&memcg_stock);
+ stock = per_cpu_ptr(&memcg_stock, cpu);
old = drain_obj_stock(stock);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
- local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ local_unlock_irqrestore_n(&memcg_stock.stock_lock, flags, cpu);
if (old)
obj_cgroup_put(old);
}
+static void drain_local_stock(struct work_struct *w)
+{
+ _drain_local_stock((int)w->data.counter);
+}
+
/*
* Cache charges(val) to local per_cpu area.
* This will be consumed by consume_stock() function, later.
@@ -2365,9 +2370,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
if (flush &&
!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
if (cpu == curcpu)
- drain_local_stock(&stock->work);
+ _drain_local_stock(cpu);
else if (!cpu_is_isolated(cpu))
- schedule_work_on(cpu, &stock->work);
+ local_queue_work_on(cpu, system_wq, &stock->work);
}
}
migrate_enable();
--
2.41.0
CC: Peter Xu
On Sat, Jul 29, 2023 at 5:38 AM Leonardo Bras <[email protected]> wrote:
>
> The problem:
> We have a few scenarios in mm that make use of local_locks() +
> schedule_work_on() due to reduced overhead on the most common local
> references. This scenario is not ideal for RT workloads though: even on
> isolated cpus, those tasks will schedule-out the sensitive RT workloads to
> perform those jobs, and usually cause missing deadlines with this.
>
> The idea:
> In PREEMPT_RT, local_locks() end up becoming spinlocks(), so there should
> be no harm in just getting another cpu's spinlock to perform the work
> on the per-cpu structure: the cacheline invalidation will already happen
> due to the usage by schedule_work_on(), and on local functions the locking
> already happens anyway.
>
> This will avoid schedule_work_on(), and thus avoid scheduling-out an
> RT workload. Given the usually brief nature of those scheduled tasks, their
> execution end up being faster than doing their scheduling.
>
> ======
>
> While I really belive the solution, there are problems with this patchset,
> which I need your suggestions for improvement:
>
> 1) Naming is horrible: I could not think on a good name for the new lock
> functions, so I lazely named it local_lock_n().
> The naming should have been making clear that we are in fact dealing
> with a local_lock but it can in some scenarios be addressing another cpu's
> local_lock, and thus we need the extra cpu parameter.
>
> Dealing with this local & remote duality, all I thought was:
> mostly_local_lock(), (or local_mostly_lock())
> local_maybe_remote_lock() <- LOL
> remote_local_lock()
> per_cpu_local_lock()
> local_lock_cpu()
>
> Please help !
>
>
> 2) Maybe naming is this hard because the interface is not the best.
> My idea was to create a simple interface to easily replace functions
> already in use, but maybe there is something better I could not see.
>
> Please contribute!
>
> 3) I am lazely setting work->data without atomic operations, which may
> be bad in some scenario. If so, even thought it can be costly, since it
> happens outside of the hotpath (local per-cpu areas) it should be no
> problem adding the atomic operation for non-RT kernels.
>
> For RT kernels, since the whole operation hapens locally, there should be
> no need of using the atomic set.
>
> Please let me know of any idea, or suggestion that can improve this RFC.
>
> Thanks a lot!
> Leo
>
> Leonardo Bras (4):
> Introducing local_lock_n() and local queue & flush
> swap: apply new local_schedule_work_on() interface
> memcontrol: apply new local_schedule_work_on() interface
> slub: apply new local_schedule_work_on() interface
>
> include/linux/local_lock.h | 18 ++++++++++
> include/linux/local_lock_internal.h | 52 +++++++++++++++++++++++++++++
> mm/memcontrol.c | 17 ++++++----
> mm/slub.c | 17 ++++++----
> mm/swap.c | 18 +++++-----
> 5 files changed, 100 insertions(+), 22 deletions(-)
>
> --
> 2.41.0
>
On Sat, Jul 29, 2023 at 05:37:33AM -0300, Leonardo Bras wrote:
> Make use of the new local_*lock_n*() and local_schedule_work_on() interface
> to improve performance & latency on PREEMTP_RT kernels.
>
> For functions that may be scheduled in a different cpu, replace
> local_*lock*() by local_lock_n*(), and replace schedule_work_on() by
> local_schedule_work_on(). The same happens for flush_work() and
> local_flush_work().
>
> This should bring no relevant performance impact on non-RT kernels:
> For functions that may be scheduled in a different cpu, the local_*lock's
> this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> mm/swap.c | 18 +++++++++---------
> 1 file changed, 9 insertions(+), 9 deletions(-)
Leo,
I think the interruptions should rather be removed for both
CONFIG_PREEMPT_RT AND !CONFIG_PREEMPT_RT.
The impact of grabbing locks must be properly analyzed and not
"rejected blindly".
Example:
commit 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8
Author: Mel Gorman <[email protected]>
Date: Fri Jun 24 13:54:23 2022 +0100
mm/page_alloc: replace local_lock with normal spinlock
struct per_cpu_pages is no longer strictly local as PCP lists can be
drained remotely using a lock for protection. While the use of local_lock
works, it goes against the intent of local_lock which is for "pure CPU
local concurrency control mechanisms and not suited for inter-CPU
concurrency control" (Documentation/locking/locktypes.rst)
local_lock protects against migration between when the percpu pointer is
accessed and the pcp->lock acquired. The lock acquisition is a preemption
point so in the worst case, a task could migrate to another NUMA node and
accidentally allocate remote memory. The main requirement is to pin the
task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.