2023-10-26 03:54:42

by Steven Rostedt

[permalink] [raw]
Subject: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

From: "Steven Rostedt (Google)" <[email protected]>

This has very good performance improvements on user space implemented spin
locks, and I'm sure this can be used for spin locks in VMs too. That will
come shortly.

I started with Thomas's PREEMPT_AUTO.patch from the rt-devel tree:

https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches

So you need to select:

CONFIG_PREEMPT_AUTO

The below is my proof of concept patch. It still has debugging in it, and
although I now use rseq, I likely used it incorrectly, and it needs to be
changed. It's missing necessary comments too. But this is still just POC.

I added a "cr_flags" to the end of the struct rseq. OK, the name sucks, I
just thought of "critical" and had to pick something. Let's bikeshed
that later. Now, that the bits live in the rseq structure, there's no need
to open up any files. Although, you need to run the test with:

GLIBC_TUNABLES=glibc.pthread.rseq=0 ./extend-sched

It will prevent glibc from adding its own rseq, and you can use the new
extensions.

Now my extend() and unextend() look like this:


static void extend(void)
{
rseq_map.cr_flags = 1;
}

static void unextend(void)
{
unsigned long prev;

prev = xchg(&rseq_map.cr_flags, 0);
if (prev & 2) {
tracefs_printf(NULL, "Yield!\n");
sched_yield();
}
}

Note that any system call will do. sched_yield() is just something that
"makes sense", but it could also be gettid(), which is probably even more
efficient as the schedule will happen on the way back to user space,
because the bit is now cleared but NEED_RESCHED_LAZY is still set.

The magic will be how we get the rseq_map to play with. There's various
ways to do that, but that's an exercise for later.

So, bit 1 is for user space to tell the kernel "please extend me", and bit
two is for the kernel to tell user space "OK, I extended you, but call
sched_yield() (or any system call) when done".

The attached test program creates 1 + number of CPUs threads, that run in a
loop for 5 seconds. Each thread will grab a user space spin lock (not a
futex, but just shared memory). Before grabbing the lock it will call
"extend()", if it fails to grab the lock, it calls "unextend()" and spins
on the lock until its free, where it will try again. Then after it gets the
lock, it will update a counter, and release the lock, calling "unextend()"
as well. Then it will spin on the counter until it increments again to
allow another task to get into the critical section.

With the init of the extend_map disabled and it doesn't use the extend
code, it ends with:

Ran for 3908165 times
Total wait time: 33.965654

I can give you stdev and all that too, but the above is pretty much the
same after several runs.

After enabling the extend code, it has:

Ran for 4829340 times
Total wait time: 32.635407

It was able to get into the critical section almost 1 million times more in
those 5 seconds! That's a 23% improvement!

The wait time for getting into the critical section also dropped by the
total of over a second (4% improvement).

I ran a traceeval tool on it (still work in progress, but I can post when
it's done), and with the following trace, and the writes to trace-marker
(tracefs_printf)

trace-cmd record -e sched_switch ./extend-sched

It showed that without the extend, each task was preempted while holding
the lock around 200 times. With the extend, only one task was ever
preempted while holding the lock, and it only happened once!

Note, I tried replacing the user space spin lock with a futex, and it
dropped performance down so much with and without the update, that the
benefit is in the noise.

Below is my patch (with debugging and on top of Thomas's PREEMPT_AUTO.patch):

Attached is the program I tested it with. It uses libtracefs to write to
the trace_marker file, but if you don't want to build it with libtracefs:

gcc -o extend-sched extend-sched.c `pkg-config --libs --cflags libtracefs` -lpthread

You can just do:

grep -v tracefs extend-sched.c > extend-sched-notracefs.c

And build that.

But either way, to run it you need to export

GLIBC_TUNABLES=glibc.pthread.rseq=0

Otherwise it will fail to register the rseq structure because glibc has
already done that, but the glibc version doesn't include the extended size.

Signed-off-by: Steven Rostedt (Google) <[email protected]>
---
Changes since v1: https://lore.kernel.org/all/[email protected]/

- Use rseq as the interface (Peter Zijlsta)

(This patch is getting smaller and smaller!)

include/uapi/linux/rseq.h | 14 ++++++++++++++
kernel/entry/common.c | 17 ++++++++++++++++-
kernel/rseq.c | 27 +++++++++++++++++++++++++++
kernel/sched/fair.c | 5 +++--
4 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae5eac9..bd3aa4085e7b 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -37,6 +37,18 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
};

+enum rseq_cr_flags_bit {
+ RSEQ_CR_FLAG_IN_CRITICAL_SECTION_BIT = 0,
+ RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED_BIT = 1,
+};
+
+enum rseq_cr_flags {
+ RSEQ_CR_FLAG_IN_CRITICAL_SECTION =
+ (1U << RSEQ_CR_FLAG_IN_CRITICAL_SECTION_BIT),
+ RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED =
+ (1U << RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED_BIT),
+};
+
/*
* struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line. It is usually declared as
@@ -148,6 +160,8 @@ struct rseq {
*/
__u32 mm_cid;

+ __u32 cr_flags;
+
/*
* Flexible array member at end of structure, after last feature field.
*/
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index c1f706038637..d8b46b9e5fd7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -143,21 +143,35 @@ void noinstr exit_to_user_mode(void)

/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
+bool rseq_ignore_lazy_resched(void);

static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
+ unsigned long ignore_mask;
+
/*
* Before returning to user space ensure that all pending work
* items have been completed.
*/
while (ti_work & EXIT_TO_USER_MODE_WORK) {
+ ignore_mask = 0;

local_irq_enable_exit_to_user(ti_work);

- if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
+ if (ti_work & _TIF_NEED_RESCHED) {
schedule();

+ } else if (ti_work & _TIF_NEED_RESCHED_LAZY) {
+ if (rseq_ignore_lazy_resched()) {
+ trace_printk("Extend!\n");
+ /* Allow to leave with NEED_RESCHED_LAZY still set */
+ ignore_mask |= _TIF_NEED_RESCHED_LAZY;
+ } else {
+ schedule();
+ }
+ }
+
if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);

@@ -184,6 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
tick_nohz_user_enter_prepare();

ti_work = read_thread_flags();
+ ti_work &= ~ignore_mask;
}

/* Return the latest work state for arch_exit_to_user_mode() */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9de6e35fe679..fd9d18f60c04 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -339,6 +339,33 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
force_sigsegv(sig);
}

+bool rseq_ignore_lazy_resched(void)
+{
+ struct task_struct *t = current;
+ u32 flags;
+
+ if (!t->rseq)
+ return false;
+
+ /* Make sure the cr_flags exist */
+ if (t->rseq_len <= offsetof(struct rseq, cr_flags))
+ return false;
+
+ if (copy_from_user(&flags, &t->rseq->cr_flags, sizeof(flags)))
+ return false;
+
+ if (!(flags & RSEQ_CR_FLAG_IN_CRITICAL_SECTION))
+ return false;
+
+ flags |= RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED;
+
+ /* If we fault writing, then do not give it an extended slice */
+ if (copy_to_user(&t->rseq->cr_flags, &flags, sizeof(flags)))
+ return false;
+
+ return true;
+}
+
#ifdef CONFIG_DEBUG_RSEQ

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 700b140ac1bb..17ca22e80384 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -993,9 +993,10 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool
resched_curr(rq);
} else {
/* Did the task ignore the lazy reschedule request? */
- if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
+ if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) {
+ trace_printk("Force resched?\n");
resched_curr(rq);
- else
+ } else
resched_curr_lazy(rq);
}
clear_buddies(cfs_rq, se);
--
2.42.0


Attachments:
(No filename) (8.76 kB)
extend-sched.c (5.49 kB)
Download all attachments

2023-10-26 11:01:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Wed, Oct 25, 2023 at 11:54:13PM -0400, Steven Rostedt wrote:

> static void extend(void)
> {
> rseq_map.cr_flags = 1;
> }
>
> static void unextend(void)
> {
> unsigned long prev;
>
> prev = xchg(&rseq_map.cr_flags, 0);

So you complain about overhead and then you add one of the most
expensive ops possible here? xchg has an implicit LOCK prefix and you
really don't need LOCK prefix here.

> if (prev & 2) {
> tracefs_printf(NULL, "Yield!\n");
> sched_yield();
> }
> }

2023-10-26 11:14:33

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Thu, 26 Oct 2023 12:59:44 +0200
Peter Zijlstra <[email protected]> wrote:

> On Wed, Oct 25, 2023 at 11:54:13PM -0400, Steven Rostedt wrote:
>
> > static void extend(void)
> > {
> > rseq_map.cr_flags = 1;
> > }
> >
> > static void unextend(void)
> > {
> > unsigned long prev;
> >
> > prev = xchg(&rseq_map.cr_flags, 0);
>
> So you complain about overhead and then you add one of the most
> expensive ops possible here? xchg has an implicit LOCK prefix and you
> really don't need LOCK prefix here.

Peter, this is the user space side, where I cut and pasted the code from
the file I attached.

That has:

static inline unsigned long
xchg(volatile unsigned *ptr, unsigned new)
{
unsigned ret = new;

asm volatile("xchg %b0,%1"
: "+r"(ret), "+m"(*(ptr))
: : "memory");
return ret;
}

-- Steve


>
> > if (prev & 2) {
> > tracefs_printf(NULL, "Yield!\n");
> > sched_yield();
> > }
> > }

2023-10-26 18:36:39

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On 2023-10-26 07:14, Steven Rostedt wrote:
> On Thu, 26 Oct 2023 12:59:44 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>> On Wed, Oct 25, 2023 at 11:54:13PM -0400, Steven Rostedt wrote:
>>
>>> static void extend(void)
>>> {
>>> rseq_map.cr_flags = 1;
>>> }
>>>
>>> static void unextend(void)
>>> {
>>> unsigned long prev;
>>>
>>> prev = xchg(&rseq_map.cr_flags, 0);
>>
>> So you complain about overhead and then you add one of the most
>> expensive ops possible here? xchg has an implicit LOCK prefix and you
>> really don't need LOCK prefix here.
>
> Peter, this is the user space side, where I cut and pasted the code from
> the file I attached.
>
> That has:
>
> static inline unsigned long
> xchg(volatile unsigned *ptr, unsigned new)
> {
> unsigned ret = new;
>
> asm volatile("xchg %b0,%1"

which has an implicit lock prefix (xchg with a memory operand is a
special-case):

Quoting Intel manual:

"If a memory operand is referenced, the processor’s locking protocol is
automatically implemented for the duration of the exchange operation,
regardless of the presence or absence of the LOCK prefix or of the value
of the IOPL. (See the LOCK prefix description in this chapter for more
information on the locking protocol.)"

Thanks,

Mathieu


> : "+r"(ret), "+m"(*(ptr))
> : : "memory");
> return ret;
> }
>
> -- Steve
>
>
>>
>>> if (prev & 2) {
>>> tracefs_printf(NULL, "Yield!\n");
>>> sched_yield();
>>> }
>>> }
>

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-10-26 18:58:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Thu, 26 Oct 2023 at 08:36, Mathieu Desnoyers
<[email protected]> wrote:
>
> > asm volatile("xchg %b0,%1"
>
> which has an implicit lock prefix (xchg with a memory operand is a
> special-case):

Yeah, this is why we do "percpu_xchg()" - which does not want locked
semantics - as a "cmpxchg" loop.

Steven, check out

arch/x86/include/asm/percpu.h

for a rough implementation of a 'xchg()' without SMP coherency, just
cpu-local one (ie atomic wrt being preempted by the kernel, but not
atomic wrt other CPU's accessing the same variable concurrently)

Linus

2023-10-26 18:59:20

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On 2023-10-26 14:50, Linus Torvalds wrote:
> On Thu, 26 Oct 2023 at 08:36, Mathieu Desnoyers
> <[email protected]> wrote:
>>
>>> asm volatile("xchg %b0,%1"
>>
>> which has an implicit lock prefix (xchg with a memory operand is a
>> special-case):
>
> Yeah, this is why we do "percpu_xchg()" - which does not want locked
> semantics - as a "cmpxchg" loop.
>
> Steven, check out
>
> arch/x86/include/asm/percpu.h
>
> for a rough implementation of a 'xchg()' without SMP coherency, just
> cpu-local one (ie atomic wrt being preempted by the kernel, but not
> atomic wrt other CPU's accessing the same variable concurrently)

Actually Steven does not need a xchg to test-and-set a single bit which
is only accessed concurrently between kernel and userspace from the same
thread. Either "bts" or "andb" should work fine.

Thanks,

Mathieu

>
> Linus

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-10-26 19:21:39

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Thu, 26 Oct 2023 14:36:36 -0400
Mathieu Desnoyers <[email protected]> wrote:

> > static inline unsigned long
> > xchg(volatile unsigned *ptr, unsigned new)
> > {
> > unsigned ret = new;
> >
> > asm volatile("xchg %b0,%1"
>
> which has an implicit lock prefix (xchg with a memory operand is a
> special-case):
>
> Quoting Intel manual:
>
> "If a memory operand is referenced, the processor’s locking protocol is
> automatically implemented for the duration of the exchange operation,
> regardless of the presence or absence of the LOCK prefix or of the value
> of the IOPL. (See the LOCK prefix description in this chapter for more
> information on the locking protocol.)"
>

Bah. I learn something new every day :-p

I thought xchg required specifying lock too. This means that cmpxchg is
actually faster than xchg! It's been a long time since I had written
assembly.

What makes this interesting though, is that when I ran my tests originally,
when my change was disabled, the xchg() code never executed, and it still
showed a significant improvement. That is, even with adding xchg(), it
still improved much more. But that's probably because the xchg() happens
after releasing the lock and after that it goes into a loop waiting for
another thread to make the update, which requires a lock cmpxchg(). Hence,
the xchg() slowdown wasn't in a critical path of the test.

Anyway, I changed the code to use:

static inline unsigned clrbit(volatile unsigned *ptr)
{
unsigned ret;

asm volatile("andb %b1,%0"
: "+m" (*(volatile char *)ptr)
: "iq" (0x2)
: "memory");

ret = *ptr;
*ptr = 0;

return ret;
}

I just used andb as that's a locally atomic operation. I could also use bts
(as Mathieu suggested to me on IRC). It doesn't matter as this is just
example code and is not in production. How this is implemented is not an
important part of the algorithm. Just that it can atomically clear the bit
it sets without a race with the kernel setting the bit it sets. I could
modify the code to put these bits in separate bytes as well. That's just an
implementation detail we can work on later.

I updated my test to have:

static bool no_rseq;

static void init_extend_map(void)
{
int ret;

if (no_rseq)
return;

ret = rseq(&rseq_map, sizeof(rseq_map), 0, 0);
perror("rseq");
printf("ret = %d (%zd) %p\n", ret, sizeof(rseq_map), &rseq_map);
}

static inline unsigned clrbit(volatile unsigned *ptr)
{
unsigned ret;

asm volatile("andb %b1,%0"
: "+m" (*(volatile char *)ptr)
: "iq" (0x2)
: "memory");

ret = *ptr;
*ptr = 0;

return ret;
}

static void extend(void)
{
if (no_rseq)
return;

rseq_map.cr_flags = 1;
}

static void unextend(void)
{
unsigned prev;

if (no_rseq)
return;

prev = clrbit(&rseq_map.cr_flags);
if (prev & 2) {
tracefs_printf(NULL, "Yield!\n");
sched_yield();
}
}

where the tests with it disabled do not run the extend or unextend() code
(as it makes sense to only perform that code with this feature, as that
would be part of the overhead to implement it).

Here's the results:

With no_rseq = true, with 5 runs, which were basically all the same, but
the best run was:

Ran for 4260797 times
Total wait time: 33.905306

And with no_rseq = false;

Most runs were like:

Ran for 5378845 times
Total wait time: 32.253018

But the worse run was:

Ran for 4991411 times
Total wait time: 31.745662

But with that lower "wait time" it probably was preempted by a something
else during that run, as the lower the "wait time" the better the result.

-- Steve

2023-10-26 19:37:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Thu, 26 Oct 2023 14:59:13 -0400
Mathieu Desnoyers <[email protected]> wrote:

> > Steven, check out
> >
> > arch/x86/include/asm/percpu.h
> >
> > for a rough implementation of a 'xchg()' without SMP coherency, just
> > cpu-local one (ie atomic wrt being preempted by the kernel, but not
> > atomic wrt other CPU's accessing the same variable concurrently)
>
> Actually Steven does not need a xchg to test-and-set a single bit which
> is only accessed concurrently between kernel and userspace from the same
> thread. Either "bts" or "andb" should work fine.

Thanks Linus and Mathieu!

Yeah, I just changed it to andb and that works just as well (or better).

When I first made the change, it became substantially worse. But then I
realized it was because I had "addb" and not "andb" :-D

-- Steve

2023-10-26 20:46:44

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Thu, 26 Oct 2023 14:59:13 -0400
Mathieu Desnoyers <[email protected]> wrote:

> > for a rough implementation of a 'xchg()' without SMP coherency, just
> > cpu-local one (ie atomic wrt being preempted by the kernel, but not
> > atomic wrt other CPU's accessing the same variable concurrently)
>
> Actually Steven does not need a xchg to test-and-set a single bit which
> is only accessed concurrently between kernel and userspace from the same
> thread. Either "bts" or "andb" should work fine.

Hmm, how would bts work? Doesn't that just set a bit?

I need to clear one bit while seeing if another bit is set. I could also
use subl, as that would also atomically clear the bit.

-- Steve

2023-10-26 21:36:05

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Thu, 26 Oct 2023 15:20:22 -0400
Steven Rostedt <[email protected]> wrote:

> Anyway, I changed the code to use:
>
> static inline unsigned clrbit(volatile unsigned *ptr)
> {
> unsigned ret;
>
> asm volatile("andb %b1,%0"
> : "+m" (*(volatile char *)ptr)
> : "iq" (0x2)
> : "memory");
>
> ret = *ptr;
> *ptr = 0;
>
> return ret;
> }

Mathieu also told me that glibc's rseq has some extra padding at the end,
that happens to be big enough to hold this feature. That means you can run
the code without adding:

GLIBC_TUNABLES=glibc.pthread.rseq=0

Attached is the updated test program.

-- Steve


Attachments:
(No filename) (659.00 B)
extend-sched.c (6.05 kB)
Download all attachments

2023-10-27 16:10:01

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On 2023-10-26 16:45, Steven Rostedt wrote:
> On Thu, 26 Oct 2023 14:59:13 -0400
> Mathieu Desnoyers <[email protected]> wrote:
>
>>> for a rough implementation of a 'xchg()' without SMP coherency, just
>>> cpu-local one (ie atomic wrt being preempted by the kernel, but not
>>> atomic wrt other CPU's accessing the same variable concurrently)
>>
>> Actually Steven does not need a xchg to test-and-set a single bit which
>> is only accessed concurrently between kernel and userspace from the same
>> thread. Either "bts" or "andb" should work fine.
>
> Hmm, how would bts work? Doesn't that just set a bit?

Yes, and it saves the bit state in the CF flag before setting it.

If you need to clear a bit then it would be "btr".

>
> I need to clear one bit while seeing if another bit is set. I could also
> use subl, as that would also atomically clear the bit.

Ah ok, I did not get that you needed to test for a different bit than
the one you clear.

Thanks,

Mathieu

>
> -- Steve

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-10-27 16:21:45

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On 2023-10-26 17:35, Steven Rostedt wrote:
> On Thu, 26 Oct 2023 15:20:22 -0400
> Steven Rostedt <[email protected]> wrote:
>
>> Anyway, I changed the code to use:
>>
>> static inline unsigned clrbit(volatile unsigned *ptr)
>> {
>> unsigned ret;
>>
>> asm volatile("andb %b1,%0"
>> : "+m" (*(volatile char *)ptr)
>> : "iq" (0x2)
>> : "memory");
>>
>> ret = *ptr;
>> *ptr = 0;
>>
>> return ret;
>> }
>
> Mathieu also told me that glibc's rseq has some extra padding at the end,
> that happens to be big enough to hold this feature. That means you can run
> the code without adding:
>
> GLIBC_TUNABLES=glibc.pthread.rseq=0
>
> Attached is the updated test program.

I think you'll want to modify the semantic of your "cr_flags" field so
it supports nested locks as well. You can change this cr_flags for a
nesting counter. The "yield" bit could be one of the bits of that
counter (e.g. lowest bit).

Therefore extend() become add 0x2, and unextend() become a sub 0x2, and
you can check the lowest bit for yield hint.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-10-27 16:26:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Fri, 27 Oct 2023 12:09:56 -0400
Mathieu Desnoyers <[email protected]> wrote:

> > I need to clear one bit while seeing if another bit is set. I could also
> > use subl, as that would also atomically clear the bit.
>
> Ah ok, I did not get that you needed to test for a different bit than
> the one you clear.

Yeah, maybe I'm not articulating the implementation as well.

bit 0: Set by user space to tell the kernel it's in a critical section

bit 1: Set by kernel that it gave user space extend time slice

Bit 1 will only be set by the kernel if bit 0 is set.

When entering a critical section, user space will set bit 0. When it leaves
the critical section, it needs to clear bit 0, but also needs to handle the
race condition from where it clears the bit and where the kernel could
preempt it and set bit 1. Thus it needs an atomic operation to clear bit 0
without affecting bit 1. Once bit 0 is cleared, it does not need to worry
about bit 1 being set after that as the kernel will only set bit 1 if it
sees that bit 0 was set. After user space clears bit 0, it must check bit 1
to see if it should now schedule. And it's also up to user space to clear
bit 1, but it can do that at any time with bit 0 cleared.

extend() {
cr_flags = 1;
}

unextend() {
cr_flags &= ~1; /* Must be atomic */
if (cr_flags & 2) {
cr_flags = 0; /* May not be necessary as it gets cleared by extend() */
sched_yield();
}
}

Does that make more sense?

-- Steve

2023-10-27 16:36:06

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On 2023-10-27 12:24, Steven Rostedt wrote:
> On Fri, 27 Oct 2023 12:09:56 -0400
> Mathieu Desnoyers <[email protected]> wrote:
>
>>> I need to clear one bit while seeing if another bit is set. I could also
>>> use subl, as that would also atomically clear the bit.
>>
>> Ah ok, I did not get that you needed to test for a different bit than
>> the one you clear.
>
> Yeah, maybe I'm not articulating the implementation as well.
>
> bit 0: Set by user space to tell the kernel it's in a critical section
>
> bit 1: Set by kernel that it gave user space extend time slice
>
> Bit 1 will only be set by the kernel if bit 0 is set.
>
> When entering a critical section, user space will set bit 0. When it leaves
> the critical section, it needs to clear bit 0, but also needs to handle the
> race condition from where it clears the bit and where the kernel could
> preempt it and set bit 1. Thus it needs an atomic operation to clear bit 0
> without affecting bit 1. Once bit 0 is cleared, it does not need to worry
> about bit 1 being set after that as the kernel will only set bit 1 if it
> sees that bit 0 was set. After user space clears bit 0, it must check bit 1
> to see if it should now schedule. And it's also up to user space to clear
> bit 1, but it can do that at any time with bit 0 cleared.
>
> extend() {
> cr_flags = 1;
> }
>
> unextend() {
> cr_flags &= ~1; /* Must be atomic */
> if (cr_flags & 2) {
> cr_flags = 0; /* May not be necessary as it gets cleared by extend() */
> sched_yield();
> }
> }
>
> Does that make more sense?

Not really.

Please see my other email about the need for a reference count here, for
nested locks use-cases.

By "atomic" operation I suspect you only mean "single instruction" which can
alter the state of the field and keep its prior content in a register, not a
lock-prefixed atomic operation, right ?

The only reason why you have this asm trickiness is because both states
are placed into different bits from the same word, which is just an
optimization. You could achieve the same much more simply by splitting
this state in two different words, e.g.:

extend() {
WRITE_ONCE(__rseq_abi->cr_nest, __rseq_abi->cr_nest + 1);
barrier()
}

unextend() {
barrier()
WRITE_ONCE(__rseq_abi->cr_nest, __rseq_abi->cr_nest - 1);
if (READ_ONCE(__rseq_abi->must_yield)) {
WRITE_ONCE(__rseq_abi->must_yield, 0);
sched_yield();
}
}

Or am I missing something ?

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-10-27 16:44:36

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Fri, 27 Oct 2023 12:21:45 -0400
Mathieu Desnoyers <[email protected]> wrote:

> > Attached is the updated test program.
>
> I think you'll want to modify the semantic of your "cr_flags" field so
> it supports nested locks as well. You can change this cr_flags for a
> nesting counter. The "yield" bit could be one of the bits of that
> counter (e.g. lowest bit).

Hmm, yeah, we could just have it be: bit 0 set by the kernel, and allow
user space set any bit above that to say "I'm in a critical section". Then,
the kernel would do:

if (cr_flags & ~1) // in critical section

and this would allow user space to us cr_flags as the nested counter if it
wants to. I just don't want that to be a kernel requirement, but allowing
*any* bit to be set above bit 0 would allow user space to decide that or
not.

>
> Therefore extend() become add 0x2, and unextend() become a sub 0x2, and
> you can check the lowest bit for yield hint.

Right, that makes sense. I'll update that in the next version.

Thanks!

-- Steve

2023-10-27 16:50:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Fri, 27 Oct 2023 12:35:56 -0400
Mathieu Desnoyers <[email protected]> wrote:

> > Does that make more sense?
>
> Not really.
>
> Please see my other email about the need for a reference count here, for
> nested locks use-cases.

Note, my original implementation of nested locking was done completely in
user space.

int __thread lock_cnt;

extend() {
if (lock_cnt++)
return;
...
}

unextend() {
if (--lock_cnt)
return;
...
}

>
> By "atomic" operation I suspect you only mean "single instruction" which can
> alter the state of the field and keep its prior content in a register, not a
> lock-prefixed atomic operation, right ?

Correct. Just a per cpu atomic. Hence a "andb" instruction, or the "subl",
or whatever.

>
> The only reason why you have this asm trickiness is because both states
> are placed into different bits from the same word, which is just an
> optimization. You could achieve the same much more simply by splitting
> this state in two different words, e.g.:
>
> extend() {
> WRITE_ONCE(__rseq_abi->cr_nest, __rseq_abi->cr_nest + 1);
> barrier()
> }
>
> unextend() {
> barrier()
> WRITE_ONCE(__rseq_abi->cr_nest, __rseq_abi->cr_nest - 1);
> if (READ_ONCE(__rseq_abi->must_yield)) {
> WRITE_ONCE(__rseq_abi->must_yield, 0);
> sched_yield();
> }
> }
>
> Or am I missing something ?

I mentioned about placing this in different bytes, although I meant words,
but yeah, if we make them separate it would make it easier. But me being
frugal about memory, If this was just two bits (or even a counter with an
extra bit) I didn't think about wasting two words for what can be done with
one.

But this is still an implementation detail, and this code is still very
much in flux, and I'm not as worried about those details yet.

-- Steve

2023-10-27 21:52:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

As this change showed a really good improvement in my micro-benchmark, I
decided to give it a try on something that's a bit more of a real world
benchmark: PostgreSQL.

Hence, I'm also Cc'ing the postgres maintainers in case they are interested.
To bring them up to speed, here's what's been going on:

A new feature is being discussed to add NEED_RESCHED_LAZY that will allow us
(the kernel developers) to remove CONFIG_PREEMPT_NONE and
CONFIG_PREEMPT_VOLUNTARY, and move that to user space runtime switches. The
idea is that when the scheduler tick goes off and it's time to schedule out a
SCHED_OTHER task, user space can have the option of not doing that in the
kernel and wait till it enters back into user space before scheduling
(simulating PREEMPT_NONE). The scheduler tick will set NEED_RESCHED_LAZY
(instead of NEED_RESCHED which always schedules when possible), and that bit
will only be looked at when exiting back into user space, where it will
perform the schedule if it is set.

My idea is to extend this into user space as well. Using the restartable
sequences infrastructure (https://lwn.net/Articles/697979/) that maps memory
between kernel and user space for threads, I'll use two bits (or one bit and a
counter, but that's for later, I'll just discuss the current implementation).

bit 0: is set by user space to tell the kernel that it's in a critical
section.

bit 1: is set by the kernel telling user space that it granted it a bit more
time and that it should call back into the kernel with any system call
(sched_yield() or gettid()), when it is out of its critical section.

Bit 1 will never be set if bit 0 is not set (Note, there's talk about making
bit 0 the one set by the kernel, or use a different word entirely to allow
the rest of the bits to be used as a counter for nested critical sections).

Now when returning back to user space, if the critical section bit (or
counter) is set, then it will not call schedule when NEED_RESCHED_LAZY is set.
Note that it will still always call schedule on NEED_RESCHED. This gives user
space one more tick (1 ms with 1000 HZ kernel config, to 4 ms with 250 HZ
kernel config). When user space is done with its critical section, it should
check the bit that can be set by the kernel to see if it should then schedule.

If the user side bit is not cleared after a tick, then the kernel will set
NEED_RESCHED which will force a schedule no matter where user space happens to
be. Note, this could also hurt that task in that the scheduler will take away
the eligibility of that task to balance out the amount of extra time the task
ran for, not to mention, the force schedule could now land in a critical
section.

Back in 2014 at the Linux Collaboration Summit in Napa Valley I had a nice
conversation with Robet Haas about user space spin locks. He told me how they
are used in PostgreSQL where futexes did not meet their needs. This
conversation kicked off the idea about implementing user space adaptive spin
locks (which last year in Belgium, I asked André Almeida to implement -
https://lwn.net/Articles/931789/).

Even though user space adaptive spinning would greatly help out the contention
of a lock, there's still the issue of a lock owner being preempted which would
cause all those waiters to also go into the kernel and delay access to the
critical sections. In the real time kernel, this was solved by
NEED_RESCHED_LAZY:

https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/preempt-lazy-support.patch?h=v5.9-rc2-rt1-patches

Now Thomas has proposed using a similar solution to solve the PREEMPT_NONE /
VOLUNTARY issue.

https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/

Which now has a prototype in the rt-devel tree:

https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches

For which I applied to v6.6-rc4 (my current working branch), and applied the
solution I explained above (POC with debugging still in it):

https://lore.kernel.org/all/[email protected]/

Now I downloaded the latest postgres from:

https://github.com/postgres/postgres.git

And built sha: 26f988212eada9c586223cbbf876c7eb455044d9

After installing it, I rebooted the machine with the updated kernel (requires
CONFIG_PREEMPT_AUTO being set), and ran the unmodified version of postgres
pgbench test:

pgbench -c 100 -T 300 -j 8 -S -n

I ran it 16 times and looked at the transactions per second counter (tps).
Note, I threw out the first run as it had horrible numbers probably due to
everything in cold cache (memory and file system).

Then I applied the below patch, did a make clean, make install, rebooted the
box again and ran the test for another 16 times (again, the first run was
horrible).

Here are the results of the tests: I only used the 15 runs after the first run
for comparisons.

Without the patched postgres executable:

First run:
tps = 72573.188203 (without initial connection time)

15 runs:
tps = 74315.731978 (without initial connection time)
tps = 74448.130108 (without initial connection time)
tps = 74662.246100 (without initial connection time)
tps = 73124.961311 (without initial connection time)
tps = 74653.611878 (without initial connection time)
tps = 74765.296134 (without initial connection time)
tps = 74497.066104 (without initial connection time)
tps = 74541.664031 (without initial connection time)
tps = 74595.032066 (without initial connection time)
tps = 74545.876793 (without initial connection time)
tps = 74762.560651 (without initial connection time)
tps = 74528.657018 (without initial connection time)
tps = 74814.700753 (without initial connection time)
tps = 74687.980967 (without initial connection time)
tps = 74973.185122 (without initial connection time)

With the patched postgres executable:

First run:
tps = 73562.005970 (without initial connection time)

15 runs:
tps = 74560.101322 (without initial connection time)
tps = 74711.177071 (without initial connection time)
tps = 74551.093281 (without initial connection time)
tps = 74559.452628 (without initial connection time)
tps = 74737.604361 (without initial connection time)
tps = 74521.606019 (without initial connection time)
tps = 74870.859166 (without initial connection time)
tps = 74545.423471 (without initial connection time)
tps = 74805.939815 (without initial connection time)
tps = 74665.240730 (without initial connection time)
tps = 74701.479550 (without initial connection time)
tps = 74897.154079 (without initial connection time)
tps = 74879.687067 (without initial connection time)
tps = 74792.563116 (without initial connection time)
tps = 74852.101317 (without initial connection time)

Without the patch:

Average: 74527.7800
Std Dev: 420.6304

With the patch:
Average: 74710.0988
Std Dev: 136.7250

Notes about my setup. I ran this on one of my older test boxes (pretty much the
last of the bare metal machines I test on, as now I do most on VMs, but did not
want to run these tests on VMs).

It's a 4 core (2 hyper threaded) total of 8 CPUs:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

32G of RAM.

Postgres has several types of locks. I first applied the extend()/unextend() to
only the raw spin locks but that didn't make much difference. When I traced
it, I found that the time slice seldom landed when one of these spin locks
were held. 10 or so during a 5 minute run (I added writes to the kernel
tracing buffer via trace_marker to know when the locks were held, and also
recorded sched_switch to see if it was ever preempted). So I expanded the
usage to the "light weight locks" (lwlocks), which are similar to the spin
locks but does some complex backing off. Basically a heuristic spin. Anyway,
after adding the logic to these locks, it definitely made a difference. Not a
huge one, but it was noticeable beyond the noise. I can imagine that if this
was implemented on a machine with many more CPUs than 8, it would make an even
bigger difference.

I also had to rerun my tests because I left some kernel config options enabled
that affected performance. I didn't want that to skew the results. But the
results were similar, except that with the slower kernel, the worse
performance with the patch was better than the best performance without it.
Not by much, but still. After removing the slowdown, that was no longer the
case. But I noticed that with the patch, the standard deviation was much
smaller than without the patch. I'm guessing that without the patch it depends
on how the scheduler interacts with the locking much more and a good run
without the patch was just the test being "lucky" that it didn't get preempted
as much in a critical section.

I personally think the smaller standard deviation is a win as it makes the
database run with a more deterministic behavior.

Anyway, I'd really like to know what others think about this, and perhaps they
can run this on their own testing infrastructure. All the code is available to
reproduce. If you want to reproduce it like I did. Checkout the latest Linus
tree (or do what I have which is v6.6-rc4), apply the above mentioned
PREEMPT_AUTO.patch, then apply my kernel patch. Select CONFIG_PREEMPT_AUTO and
build the kernel. Don't worry about that big banner that is on the kernel
console at boot up telling you that you are running a debug kernel. You are
running one, and it's because I still have a couple of trace_printk()s in it.

Run your postgres tests with and without the patch and let me know if there's a
difference. Only if you think it's worth it. Let me know if you don't ;-)

-- Steve

[ The biggest part of the below change is adding in the standard rseq_abi.h header ]


diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 315a78cda9..652d3a5560 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -89,11 +89,12 @@
#include "storage/spin.h"
#include "utils/memutils.h"

+#include "storage/rseq-abi.h"
+
#ifdef LWLOCK_STATS
#include "utils/hsearch.h"
#endif

-
/* We use the ShmemLock spinlock to protect LWLockCounter */
extern slock_t *ShmemLock;

@@ -841,6 +842,8 @@ LWLockAttemptLock(LWLock *lock, LWLockMode mode)
desired_state += LW_VAL_SHARED;
}

+ if (lock_free)
+ extend();
/*
* Attempt to swap in the state we are expecting. If we didn't see
* lock to be free, that's just the old value. If we saw it as free,
@@ -863,9 +866,14 @@ LWLockAttemptLock(LWLock *lock, LWLockMode mode)
#endif
return false;
}
- else
+ else {
+ if (lock_free)
+ unextend();
return true; /* somebody else has the lock */
+ }
}
+ if (lock_free)
+ unextend();
}
pg_unreachable();
}
@@ -1868,6 +1876,7 @@ LWLockRelease(LWLock *lock)
LWLockWakeup(lock);
}

+ unextend();
/*
* Now okay to allow cancel/die interrupts.
*/
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 327ac64f7c..c22310cfe3 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -55,6 +55,8 @@
#include "storage/s_lock.h"
#include "utils/wait_event.h"

+#include "storage/rseq-abi.h"
+
#define MIN_SPINS_PER_DELAY 10
#define MAX_SPINS_PER_DELAY 1000
#define NUM_DELAYS 1000
@@ -66,7 +68,6 @@ slock_t dummy_spinlock;

static int spins_per_delay = DEFAULT_SPINS_PER_DELAY;

-
/*
* s_lock_stuck() - complain about a stuck spinlock
*/
@@ -94,6 +95,8 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
{
SpinDelayStatus delayStatus;

+ unextend();
+
init_spin_delay(&delayStatus, file, line, func);

while (TAS_SPIN(lock))
@@ -102,6 +105,7 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
}

finish_spin_delay(&delayStatus);
+ extend();

return delayStatus.delays;
}
diff --git a/src/include/storage/rseq-abi.h b/src/include/storage/rseq-abi.h
new file mode 100644
index 0000000000..b858cf1d6f
--- /dev/null
+++ b/src/include/storage/rseq-abi.h
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _RSEQ_ABI_H
+#define _RSEQ_ABI_H
+
+/*
+ * rseq-abi.h
+ *
+ * Restartable sequences system call API
+ *
+ * Copyright (c) 2015-2022 Mathieu Desnoyers <[email protected]>
+ */
+
+#include <ctype.h>
+#include <asm/types.h>
+
+enum rseq_abi_cpu_id_state {
+ RSEQ_ABI_CPU_ID_UNINITIALIZED = -1,
+ RSEQ_ABI_CPU_ID_REGISTRATION_FAILED = -2,
+};
+
+enum rseq_abi_flags {
+ RSEQ_ABI_FLAG_UNREGISTER = (1 << 0),
+};
+
+enum rseq_abi_cs_flags_bit {
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+};
+
+enum rseq_abi_cs_flags {
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT =
+ (1U << RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT),
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL =
+ (1U << RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE =
+ (1U << RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+};
+
+/*
+ * struct rseq_abi_cs is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line. It is usually declared as
+ * link-time constant data.
+ */
+struct rseq_abi_cs {
+ /* Version of this structure. */
+ __u32 version;
+ /* enum rseq_abi_cs_flags */
+ __u32 flags;
+ __u64 start_ip;
+ /* Offset from start_ip. */
+ __u64 post_commit_offset;
+ __u64 abort_ip;
+} __attribute__((aligned(4 * sizeof(__u64))));
+
+/*
+ * struct rseq_abi is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line.
+ *
+ * A single struct rseq_abi per thread is allowed.
+ */
+struct rseq_abi {
+ /*
+ * Restartable sequences cpu_id_start field. Updated by the
+ * kernel. Read by user-space with single-copy atomicity
+ * semantics. This field should only be read by the thread which
+ * registered this data structure. Aligned on 32-bit. Always
+ * contains a value in the range of possible CPUs, although the
+ * value may not be the actual current CPU (e.g. if rseq is not
+ * initialized). This CPU number value should always be compared
+ * against the value of the cpu_id field before performing a rseq
+ * commit or returning a value read from a data structure indexed
+ * using the cpu_id_start value.
+ */
+ __u32 cpu_id_start;
+ /*
+ * Restartable sequences cpu_id field. Updated by the kernel.
+ * Read by user-space with single-copy atomicity semantics. This
+ * field should only be read by the thread which registered this
+ * data structure. Aligned on 32-bit. Values
+ * RSEQ_CPU_ID_UNINITIALIZED and RSEQ_CPU_ID_REGISTRATION_FAILED
+ * have a special semantic: the former means "rseq uninitialized",
+ * and latter means "rseq initialization failed". This value is
+ * meant to be read within rseq critical sections and compared
+ * with the cpu_id_start value previously read, before performing
+ * the commit instruction, or read and compared with the
+ * cpu_id_start value before returning a value loaded from a data
+ * structure indexed using the cpu_id_start value.
+ */
+ __u32 cpu_id;
+ /*
+ * Restartable sequences rseq_cs field.
+ *
+ * Contains NULL when no critical section is active for the current
+ * thread, or holds a pointer to the currently active struct rseq_cs.
+ *
+ * Updated by user-space, which sets the address of the currently
+ * active rseq_cs at the beginning of assembly instruction sequence
+ * block, and set to NULL by the kernel when it restarts an assembly
+ * instruction sequence block, as well as when the kernel detects that
+ * it is preempting or delivering a signal outside of the range
+ * targeted by the rseq_cs. Also needs to be set to NULL by user-space
+ * before reclaiming memory that contains the targeted struct rseq_cs.
+ *
+ * Read and set by the kernel. Set by user-space with single-copy
+ * atomicity semantics. This field should only be updated by the
+ * thread which registered this data structure. Aligned on 64-bit.
+ */
+ union {
+ __u64 ptr64;
+
+ /*
+ * The "arch" field provides architecture accessor for
+ * the ptr field based on architecture pointer size and
+ * endianness.
+ */
+ struct {
+#ifdef __LP64__
+ __u64 ptr;
+#elif defined(__BYTE_ORDER) ? (__BYTE_ORDER == __BIG_ENDIAN) : defined(__BIG_ENDIAN)
+ __u32 padding; /* Initialized to zero. */
+ __u32 ptr;
+#else
+ __u32 ptr;
+ __u32 padding; /* Initialized to zero. */
+#endif
+ } arch;
+ } rseq_cs;
+
+ /*
+ * Restartable sequences flags field.
+ *
+ * This field should only be updated by the thread which
+ * registered this data structure. Read by the kernel.
+ * Mainly used for single-stepping through rseq critical sections
+ * with debuggers.
+ *
+ * - RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT
+ * Inhibit instruction sequence block restart on preemption
+ * for this thread.
+ * - RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL
+ * Inhibit instruction sequence block restart on signal
+ * delivery for this thread.
+ * - RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE
+ * Inhibit instruction sequence block restart on migration for
+ * this thread.
+ */
+ __u32 flags;
+
+ /*
+ * Restartable sequences node_id field. Updated by the kernel. Read by
+ * user-space with single-copy atomicity semantics. This field should
+ * only be read by the thread which registered this data structure.
+ * Aligned on 32-bit. Contains the current NUMA node ID.
+ */
+ __u32 node_id;
+
+ /*
+ * Restartable sequences mm_cid field. Updated by the kernel. Read by
+ * user-space with single-copy atomicity semantics. This field should
+ * only be read by the thread which registered this data structure.
+ * Aligned on 32-bit. Contains the current thread's concurrency ID
+ * (allocated uniquely within a memory map).
+ */
+ __u32 mm_cid;
+
+ __u32 cr_flags;
+ /*
+ * Flexible array member at end of structure, after last feature field.
+ */
+ char end[];
+} __attribute__((aligned(4 * sizeof(__u64))));
+
+#endif /* _RSEQ_ABI_H */
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index c9fa84cc43..999a056c6f 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -96,6 +96,10 @@
#ifndef S_LOCK_H
#define S_LOCK_H

+#define __GNU_SOURCE
+#include <unistd.h>
+#include "storage/rseq-abi.h"
+
#ifdef FRONTEND
#error "s_lock.h may not be included from frontend code"
#endif
@@ -131,6 +135,53 @@
*----------
*/

+extern ptrdiff_t __rseq_offset;
+extern unsigned int __rseq_size;
+
+static inline unsigned clear_extend(volatile unsigned *ptr)
+{
+ unsigned ret;
+
+ asm volatile("andb %b1,%0"
+ : "+m" (*(volatile char *)ptr)
+ : "iq" (~0x1)
+ : "memory");
+
+ ret = *ptr;
+ *ptr = 0;
+
+ return ret;
+}
+
+static inline void extend(void)
+{
+ struct rseq_abi *rseq_ptr;
+
+ if (!__rseq_size)
+ return;
+
+ rseq_ptr = (void *)((unsigned long)__builtin_thread_pointer() + __rseq_offset);
+ rseq_ptr->cr_flags = 1;
+}
+
+static inline void unextend(void)
+{
+ struct rseq_abi *rseq_ptr;
+ unsigned prev;
+
+ if (!__rseq_size)
+ return;
+
+ rseq_ptr = (void *)((unsigned long)__builtin_thread_pointer() + __rseq_offset);
+
+ prev = clear_extend(&rseq_ptr->cr_flags);
+ if (prev & 2) {
+ gettid();
+ }
+}
+
+#define __S_UNLOCK(lock) do { S_UNLOCK(lock); unextend(); } while (0)
+#define __S_LOCK(lock) do { extend(); S_LOCK(lock); } while (0)

#ifdef __i386__ /* 32-bit i386 */
#define HAS_TEST_AND_SET
diff --git a/src/include/storage/spin.h b/src/include/storage/spin.h
index 5d809cc980..4230025748 100644
--- a/src/include/storage/spin.h
+++ b/src/include/storage/spin.h
@@ -59,9 +59,9 @@

#define SpinLockInit(lock) S_INIT_LOCK(lock)

-#define SpinLockAcquire(lock) S_LOCK(lock)
+#define SpinLockAcquire(lock) __S_LOCK(lock)

-#define SpinLockRelease(lock) S_UNLOCK(lock)
+#define SpinLockRelease(lock) __S_UNLOCK(lock)

#define SpinLockFree(lock) S_LOCK_FREE(lock)

2023-10-30 12:56:55

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On 2023-10-27 12:49, Steven Rostedt wrote:
> On Fri, 27 Oct 2023 12:35:56 -0400
> Mathieu Desnoyers <[email protected]> wrote:
>
>>> Does that make more sense?
>>
>> Not really.
>>
>> Please see my other email about the need for a reference count here, for
>> nested locks use-cases.
>
> Note, my original implementation of nested locking was done completely in
> user space.
>
> int __thread lock_cnt;
>
> extend() {
> if (lock_cnt++)
> return;
> ...
> }
>
> unextend() {
> if (--lock_cnt)
> return;
> ...
> }

This only works if "your" lock implementation is the only user of this
RSEQ feature within a process. RSEQ requires that multiple libraries can
share the facilities. Therefore, the rseq field should include the
nesting counter as part of the RSEQ ABI so various userspace libraries
can use it collaboratively.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-10-30 13:45:56

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Mon, 30 Oct 2023 08:56:50 -0400
Mathieu Desnoyers <[email protected]> wrote:
>
> This only works if "your" lock implementation is the only user of this
> RSEQ feature within a process. RSEQ requires that multiple libraries can
> share the facilities. Therefore, the rseq field should include the
> nesting counter as part of the RSEQ ABI so various userspace libraries
> can use it collaboratively.
>

Then I would suggest allowing bit 31 be an "on/off" switch, and the lower
bits to be a counter. When I first tried this with postgres, there was one
lwlock that looked to be held for the majority of the run, so I got rid of
the nesting. But I think a mixture of both would work, where you can have a
nesting counter and an on/off switch with the caveat that if you use it and
enable it, another library may disable it.

-- Steve

2023-10-30 18:05:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On 2023-10-30 09:45, Steven Rostedt wrote:
> On Mon, 30 Oct 2023 08:56:50 -0400
> Mathieu Desnoyers <[email protected]> wrote:
>>
>> This only works if "your" lock implementation is the only user of this
>> RSEQ feature within a process. RSEQ requires that multiple libraries can
>> share the facilities. Therefore, the rseq field should include the
>> nesting counter as part of the RSEQ ABI so various userspace libraries
>> can use it collaboratively.
>>
>
> Then I would suggest allowing bit 31 be an "on/off" switch, and the lower
> bits to be a counter. When I first tried this with postgres, there was one
> lwlock that looked to be held for the majority of the run, so I got rid of
> the nesting. But I think a mixture of both would work, where you can have a
> nesting counter and an on/off switch with the caveat that if you use it and
> enable it, another library may disable it.

If you have the nesting counter, why do you need the explicit on/off
switch ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-10-30 18:20:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Mon, 30 Oct 2023 14:05:05 -0400
Mathieu Desnoyers <[email protected]> wrote:

> If you have the nesting counter, why do you need the explicit on/off
> switch ?

Because I gave up when I found that one of the lwlocks seemed to take a long
time (pretty much the entire test) or I couldn't find how it was unlocked
(the code isn't trivial).

So I just made every unlock disable the extended time slot. I need to go
back and enable both a counter and an on/off as I now realize that the spin
locks (called within the lwlock) will disable the extend time before the
lwlock is released. This should work if I have the spinlocks inc and dec
(they are straight forward and all locks are associated with an easily
found unlock), and have the lwlock use bit 31 as an on/off switch.

Anyway, I would let user space decide what it wants to do, and giving it 31
bits to say "I'm extended" and let user space come up with how it handles
those 31 bits.

-- Steve

2023-10-30 18:27:10

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On 2023-10-30 14:19, Steven Rostedt wrote:
> On Mon, 30 Oct 2023 14:05:05 -0400
> Mathieu Desnoyers <[email protected]> wrote:
>
>> If you have the nesting counter, why do you need the explicit on/off
>> switch ?
>
> Because I gave up when I found that one of the lwlocks seemed to take a long
> time (pretty much the entire test) or I couldn't find how it was unlocked
> (the code isn't trivial).
>
> So I just made every unlock disable the extended time slot. I need to go
> back and enable both a counter and an on/off as I now realize that the spin
> locks (called within the lwlock) will disable the extend time before the
> lwlock is released. This should work if I have the spinlocks inc and dec
> (they are straight forward and all locks are associated with an easily
> found unlock), and have the lwlock use bit 31 as an on/off switch.

This extra on/off switch appears to be working around userspace issues.

> Anyway, I would let user space decide what it wants to do, and giving it 31
> bits to say "I'm extended" and let user space come up with how it handles
> those 31 bits.

If this makes it into the RSEQ uapi, RSEQ should state how userspace
should collaborate wrt those bits (e.g. nesting counter protocol), even
though it's not a kernel ABI per se. Otherwise we'll just push this to
libc to specify this, which is odd.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-10-30 18:39:23

by Steven Rostedt

[permalink] [raw]
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

On Mon, 30 Oct 2023 14:27:10 -0400
Mathieu Desnoyers <[email protected]> wrote:

> > So I just made every unlock disable the extended time slot. I need to go
> > back and enable both a counter and an on/off as I now realize that the spin
> > locks (called within the lwlock) will disable the extend time before the
> > lwlock is released. This should work if I have the spinlocks inc and dec
> > (they are straight forward and all locks are associated with an easily
> > found unlock), and have the lwlock use bit 31 as an on/off switch.
>
> This extra on/off switch appears to be working around userspace issues.

Yep!

But that doesn't mean there's not a legitimate use case for it. I don't
want to limit the feature for that. It's unlikely bit 31 would ever be hit
by a counter anyway, for which it could be used as an on/off switch the
same way the NEED_RESCHED bit is used as an on/off switch for preempt_count
in the kernel.

>
> > Anyway, I would let user space decide what it wants to do, and giving it 31
> > bits to say "I'm extended" and let user space come up with how it handles
> > those 31 bits.
>
> If this makes it into the RSEQ uapi, RSEQ should state how userspace
> should collaborate wrt those bits (e.g. nesting counter protocol), even
> though it's not a kernel ABI per se. Otherwise we'll just push this to
> libc to specify this, which is odd.

I agree that user space should have the usage specified. Hell, that bit
could just be used for testing purposes. I think having it reserved is a
good thing than not specifying it and limiting its usage later.

-- Steve