Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
From:   Leonardo Bras <leobras@redhat.com>
To:     Thomas Gleixner <tglx@linutronix.de>,
        Marcelo Tosatti <mtosatti@redhat.com>,
        linux-kernel@vger.kernel.org
Cc:     Leonardo Bras <leobras@redhat.com>
Subject: [RFC PATCH 1/4] Introducing local_lock_n() and local queue & flush
Date:   Sat, 29 Jul 2023 05:37:32 -0300
Message-ID: <20230729083737.38699-3-leobras@redhat.com>
In-Reply-To: <20230729083737.38699-2-leobras@redhat.com>
References: <20230729083737.38699-2-leobras@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps the overhead low since
cacheline tends to be mostly local (and have no locks in non-RT kernels),
and the few remote operations will be more costly due to scheduling.

On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with some unrelated task is
sure to introduce unexpected deadline misses.

It's interesting, though, that local_lock()s in RT kernels become an
spinlock(), so we can use this locking cost (that is already being paid) in
order to avoid scheduling work on a remote cpu, and updating another cpu's
per_cpu structure from the current cpu, while holding it's spinlock().

In order to do that, it's necessary to introduce a new set of functions to
make it possible to get another cpu's local lock (local_*lock_n*()), and
also the corresponding local_queue_work_on() and local_flush_work()
helpers.

On non-RT kernels, every local*_n*() works the exactly same as the non-n
functions (the extra parameter is ignored), and both local_queue_work_on()
and local_flush_work() call their non-local versions.

For RT kernels, though, local_*lock_n*() will use the extra cpu parameter
to select the correct per-cpu structure to work on, and acquire the
spinlock for that cpu.

local_queue_work_on() will just call the requested function in the current
cpu: since the local_locks() are spinlocks() we are safe.

local_flush_work() then becomes a no-op since no work is actually scheduled
on a remote cpu.

Some minimal code rework is needed in order to make this mechanism work:
The calls for local_*lock*() on the functions that are currently scheduled
on remote cpus need to be replaced my local_*lock_n*(), so in RT kernels
they can reference a different cpu.

This should have almost no impact on non-RT kernels: few this_cpu_ptr()
will become per_cpu_ptr(,smp_processor_id()).

On RT kernels, this should improve performance and reduces latency by
removing scheduling noise.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/linux/local_lock.h          | 18 ++++++++++
 include/linux/local_lock_internal.h | 52 +++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index e55010fa7329..f1fa1e8e3fbc 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -51,4 +51,22 @@
 #define local_unlock_irqrestore(lock, flags)			\
 	__local_unlock_irqrestore(lock, flags)
 
+#define local_lock_n(lock, cpu)					\
+	__local_lock_n(lock, cpu)
+
+#define local_unlock_n(lock, cpu)				\
+	__local_unlock_n(lock, cpu)
+
+#define local_lock_irqsave_n(lock, flags, cpu)			\
+	__local_lock_irqsave_n(lock, flags, cpu)
+
+#define local_unlock_irqrestore_n(lock, flags, cpu)		\
+	__local_unlock_irqrestore_n(lock, flags, cpu)
+
+#define local_queue_work_on(cpu, wq, work)			\
+	__local_queue_work_on(cpu, wq, work)
+
+#define local_flush_work(work)					\
+	__local_flush_work(work)
+
 #endif
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 975e33b793a7..df064149fff8 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -98,6 +98,25 @@ do {								\
 		local_irq_restore(flags);			\
 	} while (0)
 
+#define __local_lock_n(lock, cpu)	__local_lock(lock)
+#define __local_unlock_n(lock, cpu)	__local_unlock(lock)
+
+#define __local_lock_irqsave_n(lock, flags, cpu)		\
+	__local_lock_irqsave(lock, flags)
+
+#define __local_unlock_irqrestore_n(lock, flags, cpu)		\
+	__local_unlock_irqrestore(lock, flags)
+
+#define __local_queue_work_on(cpu, wq, work)			\
+	do {							\
+		typeof(cpu) __cpu = cpu;			\
+		typeof(work) __work = work;			\
+		__work->data.counter = __cpu;			\
+		queue_work_on(__cpu, wq, __work);		\
+	} while (0)
+
+#define __local_flush_work(work)	flush_work(work)
+
 #else /* !CONFIG_PREEMPT_RT */
 
 /*
@@ -138,4 +157,37 @@ typedef spinlock_t local_lock_t;
 
 #define __local_unlock_irqrestore(lock, flags)	__local_unlock(lock)
 
+#define __local_lock_n(__lock, cpu)				\
+	do {							\
+		migrate_disable();				\
+		spin_lock(per_cpu_ptr((__lock)), cpu);		\
+	} while (0)
+
+#define __local_unlock_n(__lock, cpu)				\
+	do {							\
+		spin_unlock(per_cpu_ptr((__lock)), cpu);	\
+		migrate_enable();				\
+	} while (0)
+
+#define __local_lock_irqsave_n(lock, flags, cpu)		\
+	do {							\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		__local_lock_n(lock, cpu);			\
+	} while (0)
+
+#define __local_unlock_irqrestore_n(lock, flags, cpu)		\
+	__local_unlock_n(lock, cpu)
+
+#define __local_queue_work_on(cpu, wq, work)			\
+	do {							\
+		typeof(cpu) __cpu = cpu;			\
+		typeof(work) __work = work;			\
+		__work->data = (typeof(__work->data))__cpu;	\
+		__work->func(__work);				\
+	} while (0)
+
+#define __local_flush_work(work)				\
+	do {} while (0)
+
 #endif /* CONFIG_PREEMPT_RT */
-- 
2.41.0