2020-08-06 00:11:12

by Peter Oskolkov

[permalink] [raw]
Subject: [PATCH 1/2] membarrier: add MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU

This patchset is based on Google-internal RSEQ
work done by Paul Turner and Andrew Hunter.

When working with per-CPU RSEQ-based memory allocations,
it is sometimes important to make sure that a global
memory location is no longer accessed from RSEQ critical
sections. For example, there can be two per-CPU lists,
one is "active" and accessed per-CPU, while another one
is inactive and worked on asynchronously "off CPU" (e.g.
garbage collection is performed). Then at some point
the two lists are swapped, and a fast RCU-like mechanism
is required to make sure that the previously active
list is no longer accessed.

This patch introduces such a mechanism: in short,
membarrier() syscall issues an IPI to a CPU, restarting
any potentially active RSEQ critical sections on the CPU,
with an option to restart RSEQ CSs on all CPUs.

The second patch in the patchset adds a selftest
of this feature.

Signed-off-by: Peter Oskolkov <[email protected]>
---
include/uapi/linux/membarrier.h | 8 ++++++
kernel/sched/membarrier.c | 51 +++++++++++++++++++++++++++++++--
2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index 5891d7614c8c..169ffb613397 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -114,6 +114,13 @@
* If this command is not implemented by an
* architecture, -EINVAL is returned.
* Returns 0 on success.
+ * @MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU:
+ * If a thread belonging to the current process
+ * is currently in an RSEQ critical section on the
+ * CPU identified by flags parameter, restart it.
+ * @flags: if @flags >= 0, identifies the CPU to
+ * restart RSEQ CS on; if == -1, restarts
+ * RSEQ CSs on all CPUs.
* @MEMBARRIER_CMD_SHARED:
* Alias to MEMBARRIER_CMD_GLOBAL. Provided for
* header backward compatibility.
@@ -131,6 +138,7 @@ enum membarrier_cmd {
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED = (1 << 4),
MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE = (1 << 5),
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE = (1 << 6),
+ MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU = (1 << 7),

/* Alias for header backward compatibility. */
MEMBARRIER_CMD_SHARED = MEMBARRIER_CMD_GLOBAL,
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 168479a7d61b..edcc1386daf5 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -18,11 +18,19 @@
#define MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK 0
#endif

+#ifdef CONFIG_RSEQ
+#define MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU_BITMASK \
+ MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU
+#else
+#define MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU_BITMASK 0
+#endif
+
#define MEMBARRIER_CMD_BITMASK \
(MEMBARRIER_CMD_GLOBAL | MEMBARRIER_CMD_GLOBAL_EXPEDITED \
| MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED \
| MEMBARRIER_CMD_PRIVATE_EXPEDITED \
| MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED \
+ | MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU_BITMASK \
| MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK)

static void ipi_mb(void *info)
@@ -308,10 +316,47 @@ static int membarrier_register_private_expedited(int flags)
return 0;
}

+#ifdef CONFIG_RSEQ
+static void membarrier_rseq_ipi(void *arg)
+{
+ struct task_struct *leader = (struct task_struct *)arg;
+
+ if (current->group_leader != leader) /* Not our process. */
+ return;
+ if (!current->rseq) /* RSEQ not set up for current task/thread. */
+ return;
+
+ rseq_preempt(current);
+}
+#endif
+
+static int membarrier_private_restart_rseq_on_cpu(int cpu_id)
+{
+#ifdef CONFIG_RSEQ
+ /* syscalls are not allowed inside rseq critical sections. */
+ if (cpu_id == raw_smp_processor_id())
+ return 0;
+
+ if (cpu_id >= 0) {
+ return smp_call_function_single(cpu_id, membarrier_rseq_ipi,
+ current->group_leader, true);
+ } else if (cpu_id == -1) {
+ on_each_cpu(membarrier_rseq_ipi,
+ current->group_leader, true);
+ } else {
+ return -EINVAL;
+ }
+#endif
+ return 0;
+}
+
/**
* sys_membarrier - issue memory barriers on a set of threads
* @cmd: Takes command values defined in enum membarrier_cmd.
- * @flags: Currently needs to be 0. For future extensions.
+ * @flags: Currently needs to be 0 for all commands other than
+ * MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU: in the latter
+ * case it indicates the CPU on which to interrupt (= restart)
+ * the RSEQ critical section, or -1 to restart RSEQ CSs on all CPUs.
*
* If this system call is not implemented, -ENOSYS is returned. If the
* command specified does not exist, not available on the running
@@ -339,7 +384,7 @@ static int membarrier_register_private_expedited(int flags)
*/
SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
{
- if (unlikely(flags))
+ if (unlikely(flags && cmd != MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU))
return -EINVAL;
switch (cmd) {
case MEMBARRIER_CMD_QUERY:
@@ -369,6 +414,8 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
return membarrier_private_expedited(MEMBARRIER_FLAG_SYNC_CORE);
case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE:
return membarrier_register_private_expedited(MEMBARRIER_FLAG_SYNC_CORE);
+ case MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU:
+ return membarrier_private_restart_rseq_on_cpu(flags);
default:
return -EINVAL;
}
--
2.28.0.163.g6104cc2f0b6-goog


2020-08-06 17:11:30

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH 1/2] membarrier: add MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU

On Thu, Aug 6, 2020 at 6:48 AM <[email protected]> wrote:
>
> On Wed, Aug 05, 2020 at 05:08:58PM -0700, Peter Oskolkov wrote:
>
> Thanks for the Cc!

Always a pleasure!

(Sorry, included only membarrier maintainers in v1; in v2 included
both membarrier and rseq maintainers).

>
> > + * @MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU:
> > + * If a thread belonging to the current process
> > + * is currently in an RSEQ critical section on the
> > + * CPU identified by flags parameter, restart it.
> > + * @flags: if @flags >= 0, identifies the CPU to
> > + * restart RSEQ CS on; if == -1, restarts
> > + * RSEQ CSs on all CPUs.
>
> > + } else if (cpu_id == -1) {
> > + on_each_cpu(membarrier_rseq_ipi,
> > + current->group_leader, true);
>
> This is an unpriv IPI the world. That's a big no-no.

removed in v2.

>
> Double so because all you want to target is the current process, which
> you're defining as CLONE_THREAD, where the rest of this file uses
> CLONE_VM to define a process.

Use current->mm in v2 instead of current->group_leader. Is it better this way?

2020-08-06 17:28:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] membarrier: add MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU

On Wed, Aug 05, 2020 at 05:08:58PM -0700, Peter Oskolkov wrote:

Thanks for the Cc!

> + * @MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU:
> + * If a thread belonging to the current process
> + * is currently in an RSEQ critical section on the
> + * CPU identified by flags parameter, restart it.
> + * @flags: if @flags >= 0, identifies the CPU to
> + * restart RSEQ CS on; if == -1, restarts
> + * RSEQ CSs on all CPUs.

> + } else if (cpu_id == -1) {
> + on_each_cpu(membarrier_rseq_ipi,
> + current->group_leader, true);

This is an unpriv IPI the world. That's a big no-no.

Double so because all you want to target is the current process, which
you're defining as CLONE_THREAD, where the rest of this file uses
CLONE_VM to define a process.

2020-08-06 17:40:57

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 1/2] membarrier: add MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU

----- On Aug 6, 2020, at 1:07 PM, Peter Oskolkov [email protected] wrote:

> On Thu, Aug 6, 2020 at 6:48 AM <[email protected]> wrote:
>>
>> On Wed, Aug 05, 2020 at 05:08:58PM -0700, Peter Oskolkov wrote:
>>
>> Thanks for the Cc!
>
> Always a pleasure!
>
> (Sorry, included only membarrier maintainers in v1; in v2 included
> both membarrier and rseq maintainers).
>
>>
>> > + * @MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU:
>> > + * If a thread belonging to the current process
>> > + * is currently in an RSEQ critical section on the
>> > + * CPU identified by flags parameter, restart it.
>> > + * @flags: if @flags >= 0, identifies the CPU to
>> > + * restart RSEQ CS on; if == -1, restarts
>> > + * RSEQ CSs on all CPUs.
>>
>> > + } else if (cpu_id == -1) {
>> > + on_each_cpu(membarrier_rseq_ipi,
>> > + current->group_leader, true);
>>
>> This is an unpriv IPI the world. That's a big no-no.
>
> removed in v2.

I don't think the feature must be removed, but its implementation needs adjustment.

How about we simply piggy-back on the membarrier schemes we already have, and
implement:

membarrier_register_private_expedited(MEMBARRIER_FLAG_RSEQ)
membarrier_private_expedited(MEMBARRIER_FLAG_RSEQ)

All the logic is there to prevent sending IPIs to runqueues which are not running
threads associated with the same mm. Considering that preemption does an rseq abort,
running a thread belonging to a different mm should mean that this CPU is not
currently executing an rseq critical section, or if it was, it has already been
aborted, so it is quiescent.

Then you'll probably want to change membarrier_private_expedited so it takes an
extra "cpu" argument. If cpu=-1, iterate on all runqueues like we currently do.
If cpu >= 0, only IPI that CPU if the thread currently running has the same mm.

Also, should this belong to the membarrier or the rseq system call ? It just
looks like the membarrier happens to implement very similar things for barriers,
but arguably this is really about rseq. I wonder if we should expose this through
rseq instead, even if we end up using membarrier code.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-08-07 17:49:52

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH 1/2] membarrier: add MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU

On Thu, Aug 6, 2020 at 10:37 AM Mathieu Desnoyers
<[email protected]> wrote:
>

> >>
> >> This is an unpriv IPI the world. That's a big no-no.
> >
> > removed in v2.
>
> I don't think the feature must be removed, but its implementation needs adjustment.
>
> How about we simply piggy-back on the membarrier schemes we already have, and
> implement:
>
> membarrier_register_private_expedited(MEMBARRIER_FLAG_RSEQ)
> membarrier_private_expedited(MEMBARRIER_FLAG_RSEQ)
>
> All the logic is there to prevent sending IPIs to runqueues which are not running
> threads associated with the same mm. Considering that preemption does an rseq abort,
> running a thread belonging to a different mm should mean that this CPU is not
> currently executing an rseq critical section, or if it was, it has already been
> aborted, so it is quiescent.
>
> Then you'll probably want to change membarrier_private_expedited so it takes an
> extra "cpu" argument. If cpu=-1, iterate on all runqueues like we currently do.
> If cpu >= 0, only IPI that CPU if the thread currently running has the same mm.
>

Thanks, Mathieu! I'll prepare something based on your and Peter's feedback.

> Also, should this belong to the membarrier or the rseq system call ? It just
> looks like the membarrier happens to implement very similar things for barriers,
> but arguably this is really about rseq. I wonder if we should expose this through
> rseq instead, even if we end up using membarrier code.

Yes, this is more about rseq; on the other hand, the high-level API/behavior
looks closer to that membarrier, and a lot of code will be shared.

As you are the maintainer for both rseq and membarrier, this is for
you to decide, I guess... :)

2020-08-07 18:09:33

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 1/2] membarrier: add MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU

----- On Aug 7, 2020, at 1:48 PM, Peter Oskolkov [email protected] wrote:

> On Thu, Aug 6, 2020 at 10:37 AM Mathieu Desnoyers
> <[email protected]> wrote:
>>
[...]
>> Also, should this belong to the membarrier or the rseq system call ? It just
>> looks like the membarrier happens to implement very similar things for barriers,
>> but arguably this is really about rseq. I wonder if we should expose this
>> through
>> rseq instead, even if we end up using membarrier code.
>
> Yes, this is more about rseq; on the other hand, the high-level API/behavior
> looks closer to that membarrier, and a lot of code will be shared.
>
> As you are the maintainer for both rseq and membarrier, this is for
> you to decide, I guess... :)

Considering that membarrier has been made extensible with the cmd
argument, and on the other hand rseq can be extended with "flags", but is
currently only about registration/unregistration, I think adding a command
to membarrier is indeed a natural approach.

I am not very fond on re-purposing the membarrier flags parameter into a
cpu number though. Maybe we should tweak the membarrier system call so it
can expect 3 arguments instead ?

int membarrier(int cmd, int flags, int cpu);

where cpu is only used for specific commands.

One thing I find weird about Peter's patch is that it adds a
MEMBERRIER_CMD_PRIVATE_EXPEDITED_RSEQ without a corresponding
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ. Considering that
the SYNC_CORE variant already has its own register command, I
find it weird that the RSEQ counterpart does not have one.

Also, do we want to allow a RSEQ | SYNC_CORE private expedited
membarrier as well ? If that is the case, then we might want to
investigate exposing RSEQ-membarrier as a new membarrier flag
rather than as a stand-alone command.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-08-07 19:05:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] membarrier: add MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU

On Fri, Aug 07, 2020 at 02:07:59PM -0400, Mathieu Desnoyers wrote:
> One thing I find weird about Peter's patch is that it adds a
> MEMBERRIER_CMD_PRIVATE_EXPEDITED_RSEQ without a corresponding
> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ. Considering that
> the SYNC_CORE variant already has its own register command, I
> find it weird that the RSEQ counterpart does not have one.

I thought the register thing was about global, not private membarriers.

Anyway, it was just a quick pseudo thing to show how one can go about
adding the rseq to the mm iteration we already have.