On Oct 27, 2015 4:56 PM, "Paul Turner" <[email protected]> wrote:
>
> This is an update to the previously posted series at:
> https://lkml.org/lkml/2015/6/24/665
>
> Dave Watson has posted a similar follow-up which allows additional critical
> regions to be registered as well as single-step support at:
> https://lkml.org/lkml/2015/10/22/588
>
> This series is a new approach which introduces an alternate ABI that does not
> depend on open-coded assembly nor a central 'repository' of rseq sequences.
> Sequences may now be inlined and the preparatory[*] work for the sequence can
> be written in a higher level language.
>
> This new ABI has also been written to support debugger interaction in a way
> that the previous ABI could not.
>
> [*] A sequence essentially has 3 steps:
> 1) Determine which cpu the sequence is being run on
> 2) Preparatory work specific to the state read in 1)
> 3) A single commit instruction which finalizes any state updates.
After much delay, I have a few questions:
What are the useful commit operations? IIUC they probably need to be
single instructions, which makes it sound like they're all either RMW
or store operations. I think that plain stores are sufficient to
emulate RMW since (1) if the value changes out from under you, you'll
abort, and (2) most CPUs don't have single instruction RMW anyway.
We could probably speed up commits and make the code a bit more
obvious by feeding the kernel a pointer to a descriptor instead of
feeding it individual values. For example, the descriptor could be:
struct commit_descriptor {
unsigned long done_instruction;
unsigned long failure_instruction;
// maybe cpu and event number are in here
// maybe *pointer* to event number is in here
};
Is your scheme safe against signals that are delivered during commit?
Would the signal context need to save and zero the commit state?
I still want to see if we can get away from the kernel-managed event
counter. Would the following work:
start_sequence:
read current CPU number
change event counter
re-read current CPU number [1]
commit:
tell kernel we're committing
re-check event counter and CPU number
do the commit instruction
tell kernel we're done committing
[1] avoids a memory ordering issue if we migrate before changing the
event counter
The kernel forces an abort if, on resume from any kernel entry, the
CPU number or event counter is wrong.
If this worked, then it would be inherently debuggable, since the only
way that it would abort in a single-threaded situation is if it
migrated during commit.
--Andy
>
> We require a single instruction for (3) so that if it is interrupted in any
> way, we can proceed from (1) once more [or potentially bail].
>
> This new ABI can be described as:
> Progress is ordered as follows:
> *0. Userspace stores current event+cpu counter values
> 1. Userspace loads the rip to move to at failure into cx
> 2. Userspace loads the rip of the instruction following
> the critical section into a registered TLS address.
> 3. Userspace loads the values read at [0] into a known
> location.
> 4. Userspace tests to see whether the current event and
> cpu counter values match those stored at 0. Manually
> jumping to the address from [1] in the case of a
> mismatch.
>
> Note that if we are preempted or otherwise interrupted
> then the kernel can also now perform this comparison
> and conditionally jump us to [1].
> 5. Our final instruction before [2] is then our commit.
> The critical section is self-terminating. [2] must
> also be cleared at this point.
>
> For x86_64:
> [3] uses rdx to represent cpu and event counter as a
> single 64-bit value.
>
> For i386:
> [3] uses ax for cpu and dx for the event_counter.
>
> Both:
> Instruction after commit: rseq_state->post_commit_instr
> Current event and cpu state: rseq_state->event_and_cpu
>
> Exactly, for x86_64 this looks like:
> movq <failed>, rcx [1]
> movq $1f, <commit_instr> [2]
> cmpq <start value>, <current value> [3] (start is in rcx)
> jnz <failed> (4)
> movq <to_write>, (<target>) (5)
> 1: movq $0, <commit_instr>
>
> There has been some related discussion, which I am supportive of, in which
> we use fs/gs instead of TLS. This maps naturally to the above and removes
> the current requirement for per-thread initialization (this is a good thing!).
>
> On debugger interactions:
>
> There are some nice properties about this new style of API which allow it to
> actually support safe interactions with a debugger:
> a) The event counter is a per-cpu value. This means that we can not advance
> it if no threads from the same process execute on that cpu. This
> naturally allows basic single step support with thread-isolation.
> b) Single-step can be augmented to evalute the ABI without incrementing the
> event count.
> c) A debugger can also be augmented to evaluate this ABI and push restarts
> on the kernel's behalf.
>
> This is also compatible with David's approach of not single stepping between
> 2-4 above. However, I think these are ultimately a little stronger since true
> single-stepping and breakpoint support would be available. Which would be
> nice to allow actual debugging of sequences.
>
> (Note that I haven't bothered implementing these in the current patchset as we
> are still winnowing down on the ABI and they just add complexity. It's
> important to note that they are possible however.)
>
> Thanks,
>
> - Paul
>
>
On Wed, Apr 06, 2016 at 08:56:33AM -0700, Andy Lutomirski wrote:
> What are the useful commit operations? IIUC they probably need to be
> single instructions, which makes it sound like they're all either RMW
> or store operations. I think that plain stores are sufficient to
> emulate RMW since (1) if the value changes out from under you, you'll
> abort, and (2) most CPUs don't have single instruction RMW anyway.
Yeah so stores are sufficient. The only requirement is that the store is
a single operation/instruction. So you can typically not commit anything
wider than the native word size.
> We could probably speed up commits and make the code a bit more
> obvious by feeding the kernel a pointer to a descriptor instead of
> feeding it individual values. For example, the descriptor could be:
See the Thread-Local-ABI effort by Mathieu, the idea is to get a single
full cacheline (64bytes) fixed size thingy allocated at a fixed offset
to the TCB.
That way we can reduce to %[gf]s:offset for these here variables (I
forever forget which segment register userspace uses for TLS).
> Is your scheme safe against signals that are delivered during commit?
> Would the signal context need to save and zero the commit state?
The patches you comment on here explicitly update the event from the
signal frame setup and thereby handle this.
The update not only increments the sequence count, but also tests the
post_commit_ip thing, if set it assigns fail value to regs->ip (ie the
commit fails).
tl;dr, yes its safe against signals happening during commit.
> I still want to see if we can get away from the kernel-managed event
> counter. Would the following work:
>
> start_sequence:
> read current CPU number
> change event counter
> re-read current CPU number [1]
>
> commit:
> tell kernel we're committing
> re-check event counter and CPU number
> do the commit instruction
> tell kernel we're done committing
>
> [1] avoids a memory ordering issue if we migrate before changing the
> event counter
So currently the event and cpu form a single 64bit value, so that on
64bit we can use a single load and cmp to verify them.
So letting the user manage the event is doable, but it would still be
advisable to have the event in the same shared word.
> The kernel forces an abort if, on resume from any kernel entry, the
> CPU number or event counter is wrong.
>
> If this worked, then it would be inherently debuggable, since the only
> way that it would abort in a single-threaded situation is if it
> migrated during commit.
(or a signal happened, as per the below)
Tempting.. not signal safe though, although I suppose we can still
explicitly do the:
if (regs->ip < post_commit_ip)
regs->ip = regs->rcx;
thing on signal frame setup to abort any in-flight commit without
explicitly incrementing the sequence number.
Not having to mange the event count from kernel reduces the kernel work
to migration only -- ie. we can get rid of the preemption hook and
reduce overhead there, something I'm entirely happy with if possible.
On Apr 7, 2016 5:02 AM, "Peter Zijlstra" <[email protected]> wrote:
>
> On Wed, Apr 06, 2016 at 08:56:33AM -0700, Andy Lutomirski wrote:
> > What are the useful commit operations? IIUC they probably need to be
> > single instructions, which makes it sound like they're all either RMW
> > or store operations. I think that plain stores are sufficient to
> > emulate RMW since (1) if the value changes out from under you, you'll
> > abort, and (2) most CPUs don't have single instruction RMW anyway.
>
> Yeah so stores are sufficient. The only requirement is that the store is
> a single operation/instruction. So you can typically not commit anything
> wider than the native word size.
>
> > We could probably speed up commits and make the code a bit more
> > obvious by feeding the kernel a pointer to a descriptor instead of
> > feeding it individual values. For example, the descriptor could be:
>
> See the Thread-Local-ABI effort by Mathieu, the idea is to get a single
> full cacheline (64bytes) fixed size thingy allocated at a fixed offset
> to the TCB.
>
> That way we can reduce to %[gf]s:offset for these here variables (I
> forever forget which segment register userspace uses for TLS).
What I meant was: rather than shoving individual values into the TLABI
thing, shove in a pointer:
struct commit_info {
u64 post_commit_rip;
u32 cpu;
u64 *event;
// whatever else;
};
and then put a commit_info* in TLABI.
This would save some bytes in the TLABI structure.
>
> > Is your scheme safe against signals that are delivered during commit?
> > Would the signal context need to save and zero the commit state?
>
> The patches you comment on here explicitly update the event from the
> signal frame setup and thereby handle this.
>
> The update not only increments the sequence count, but also tests the
> post_commit_ip thing, if set it assigns fail value to regs->ip (ie the
> commit fails).
>
> tl;dr, yes its safe against signals happening during commit.
That's not quite what I meant -- see below.
>
> > I still want to see if we can get away from the kernel-managed event
> > counter. Would the following work:
> >
> > start_sequence:
> > read current CPU number
> > change event counter
> > re-read current CPU number [1]
> >
> > commit:
> > tell kernel we're committing
> > re-check event counter and CPU number
> > do the commit instruction
> > tell kernel we're done committing
> >
> > [1] avoids a memory ordering issue if we migrate before changing the
> > event counter
>
> So currently the event and cpu form a single 64bit value, so that on
> 64bit we can use a single load and cmp to verify them.
>
> So letting the user manage the event is doable, but it would still be
> advisable to have the event in the same shared word.
>
Why is a single load needed? The event and CPU in the TLABI structure
are only ever accessed from the thread in question.
> > The kernel forces an abort if, on resume from any kernel entry, the
> > CPU number or event counter is wrong.
> >
> > If this worked, then it would be inherently debuggable, since the only
> > way that it would abort in a single-threaded situation is if it
> > migrated during commit.
>
> (or a signal happened, as per the below)
>
> Tempting.. not signal safe though, although I suppose we can still
> explicitly do the:
>
> if (regs->ip < post_commit_ip)
> regs->ip = regs->rcx;
>
> thing on signal frame setup to abort any in-flight commit without
> explicitly incrementing the sequence number.
What I meant was: what if we did it a bit differently? On signal
delivery, we could stash post_commit_ip, etc in the signal context and
then zero them out. On sigreturn, we would restore them from the
signal frame back into TLABI and then, before resuming userspace, we'd
check for an abort.
That way we could take an async signal, handle it, and resume, even in
the middle of a commit, without aborting. Of course, if the signal
hander tried to access the same rseq-protected resource, it would bump
the event counter and cause an abort.
I think it's safe in the sense that a signal will abort a commit
operation. I'm wondering if we can do better: what if a signal could
avoid interfering with commit unless it would actually conflict?s food
for thought: what if, instead of aborting
--Andy
On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
> What I meant was: rather than shoving individual values into the TLABI
> thing, shove in a pointer:
>
> struct commit_info {
> u64 post_commit_rip;
> u32 cpu;
> u64 *event;
> // whatever else;
> };
>
> and then put a commit_info* in TLABI.
>
> This would save some bytes in the TLABI structure.
But would cost us extra indirections. The whole point was getting this
stuff at a constant offset from the TLS segment register.
> > So letting the user manage the event is doable, but it would still be
> > advisable to have the event in the same shared word.
> >
>
> Why is a single load needed? The event and CPU in the TLABI structure
> are only ever accessed from the thread in question.
Its not required, but being able to do a single load saves on cycles,
less cycles is more good.
> That way we could take an async signal, handle it, and resume, even in
> the middle of a commit, without aborting. Of course, if the signal
> hander tried to access the same rseq-protected resource, it would bump
> the event counter and cause an abort.
Ah, so what happens if the signal happens before the commit but after
the load of the seqcount?
Then, even if the signal motifies the count, we'll not observe.
On Thu, Apr 07, 2016 at 05:24:32PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
> > That way we could take an async signal, handle it, and resume, even in
> > the middle of a commit, without aborting. Of course, if the signal
> > hander tried to access the same rseq-protected resource, it would bump
> > the event counter and cause an abort.
>
> Ah, so what happens if the signal happens before the commit but after
> the load of the seqcount?
>
> Then, even if the signal motifies the count, we'll not observe.
Ah, and the same is true for preemptions. Which is why all this was
preemption driven, and not migration driven.
On Thu, Apr 7, 2016 at 8:24 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
>> What I meant was: rather than shoving individual values into the TLABI
>> thing, shove in a pointer:
>>
>> struct commit_info {
>> u64 post_commit_rip;
>> u32 cpu;
>> u64 *event;
>> // whatever else;
>> };
>>
>> and then put a commit_info* in TLABI.
>>
>> This would save some bytes in the TLABI structure.
>
> But would cost us extra indirections. The whole point was getting this
> stuff at a constant offset from the TLS segment register.
I don't think the extra indirections would matter much. The kernel
would have to chase the pointer, but only in the very rare case where
it resumes userspace during a commit or on the immediately following
instruction.
At the very least, post_commit_rip and the abort address (which I
forgot about) could both live in a static structure, and shoving a
pointer to *that* into TLABI space is one store instead of two.
Admittedly, this is just a minor tweak either way. I'm just wondering
an extra indirect variant ends up being slightly faster.
>
>> > So letting the user manage the event is doable, but it would still be
>> > advisable to have the event in the same shared word.
>> >
>>
>> Why is a single load needed? The event and CPU in the TLABI structure
>> are only ever accessed from the thread in question.
>
> Its not required, but being able to do a single load saves on cycles,
> less cycles is more good.
>
>> That way we could take an async signal, handle it, and resume, even in
>> the middle of a commit, without aborting. Of course, if the signal
>> hander tried to access the same rseq-protected resource, it would bump
>> the event counter and cause an abort.
>
> Ah, so what happens if the signal happens before the commit but after
> the load of the seqcount?
>
> Then, even if the signal motifies the count, we'll not observe.
>
Where exactly?
In my scheme, nothing except the kernel ever loads the seqcount. The
user code generates a fresh value, writes it to memory, and then, just
before commit, writes that same value to the TLABI area and then
double-checks that the value it wrote at the beginning is still there.
If the signal modifies the count, then the user code won't directly
notice, but prepare_exit_to_usermode on the way out of the signal will
notice that the (restored) TLABI state doesn't match the counter that
the signal handler changed and will just to the abort address.
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
On Thu, Apr 07, 2016 at 08:44:38AM -0700, Andy Lutomirski wrote:
> On Thu, Apr 7, 2016 at 8:24 AM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
> >> What I meant was: rather than shoving individual values into the TLABI
> >> thing, shove in a pointer:
> >>
> >> struct commit_info {
> >> u64 post_commit_rip;
> >> u32 cpu;
> >> u64 *event;
> >> // whatever else;
> >> };
> >>
> >> and then put a commit_info* in TLABI.
> >>
> >> This would save some bytes in the TLABI structure.
> >
> > But would cost us extra indirections. The whole point was getting this
> > stuff at a constant offset from the TLS segment register.
>
> I don't think the extra indirections would matter much. The kernel
> would have to chase the pointer, but only in the very rare case where
> it resumes userspace during a commit or on the immediately following
> instruction.
Its about userspace finding these values, not the kernel.
> At the very least, post_commit_rip and the abort address (which I
> forgot about) could both live in a static structure,
Paul keeps the abort address in rcx.
> and shoving a
> pointer to *that* into TLABI space is one store instead of two.
> > Ah, so what happens if the signal happens before the commit but after
> > the load of the seqcount?
> >
> > Then, even if the signal motifies the count, we'll not observe.
> >
>
> Where exactly?
>
> In my scheme, nothing except the kernel ever loads the seqcount. The
> user code generates a fresh value, writes it to memory, and then, just
> before commit, writes that same value to the TLABI area and then
> double-checks that the value it wrote at the beginning is still there.
>
> If the signal modifies the count, then the user code won't directly
> notice, but prepare_exit_to_usermode on the way out of the signal will
> notice that the (restored) TLABI state doesn't match the counter that
> the signal handler changed and will just to the abort address.
OK, you lost me.. commit looks like:
+ __asm__ __volatile__ goto (
+ "movq $%l[failed], %%rcx\n"
+ "movq $1f, %[commit_instr]\n"
+ "cmpq %[start_value], %[current_value]\n"
If we get preempted/signaled here without the preemption/signal entry
checking for the post_commit_instr, we'll fail hard.
+ "jnz %l[failed]\n"
+ "movq %[to_write], (%[target])\n"
+ "1: movq $0, %[commit_instr]\n"
+ : /* no outputs */
+ : [start_value]"d"(start_value.storage),
+ [current_value]"m"(__rseq_state),
+ [to_write]"r"(to_write),
+ [target]"r"(p),
+ [commit_instr]"m"(__rseq_state.post_commit_instr)
+ : "rcx", "memory"
+ : failed
+ );
On Thu, Apr 7, 2016 at 8:53 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Apr 07, 2016 at 08:44:38AM -0700, Andy Lutomirski wrote:
>> On Thu, Apr 7, 2016 at 8:24 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
>> >> What I meant was: rather than shoving individual values into the TLABI
>> >> thing, shove in a pointer:
>> >>
>> >> struct commit_info {
>> >> u64 post_commit_rip;
>> >> u32 cpu;
>> >> u64 *event;
>> >> // whatever else;
>> >> };
>> >>
>> >> and then put a commit_info* in TLABI.
>> >>
>> >> This would save some bytes in the TLABI structure.
>> >
>> > But would cost us extra indirections. The whole point was getting this
>> > stuff at a constant offset from the TLS segment register.
>>
>> I don't think the extra indirections would matter much. The kernel
>> would have to chase the pointer, but only in the very rare case where
>> it resumes userspace during a commit or on the immediately following
>> instruction.
>
> Its about userspace finding these values, not the kernel.
But userspace never reads post_commit_rip or the abort address AFAIK.
>
>> At the very least, post_commit_rip and the abort address (which I
>> forgot about) could both live in a static structure,
>
> Paul keeps the abort address in rcx.
So it's probably a wash. Paul's way is to load the abort address in
rcx (one instruction, no memory access) and to put post_commit_rip in
TLABI (one store). Mine is to put a pointer to a descriptor into
TLABI (one store). Should barely matter.
>
>> and shoving a
>> pointer to *that* into TLABI space is one store instead of two.
>
>> > Ah, so what happens if the signal happens before the commit but after
>> > the load of the seqcount?
>> >
>> > Then, even if the signal motifies the count, we'll not observe.
>> >
>>
>> Where exactly?
>>
>> In my scheme, nothing except the kernel ever loads the seqcount. The
>> user code generates a fresh value, writes it to memory, and then, just
>> before commit, writes that same value to the TLABI area and then
>> double-checks that the value it wrote at the beginning is still there.
>>
>> If the signal modifies the count, then the user code won't directly
>> notice, but prepare_exit_to_usermode on the way out of the signal will
>> notice that the (restored) TLABI state doesn't match the counter that
>> the signal handler changed and will just to the abort address.
>
>
> OK, you lost me.. commit looks like:
>
> + __asm__ __volatile__ goto (
> + "movq $%l[failed], %%rcx\n"
> + "movq $1f, %[commit_instr]\n"
> + "cmpq %[start_value], %[current_value]\n"
>
> If we get preempted/signaled here without the preemption/signal entry
> checking for the post_commit_instr, we'll fail hard.
Not if I'm thinking about this right. Because, in my scheme, we'd be
storing both the address of the counter [1] and the value that we
expect it to be (current_value? something's confused in the asm or
the variable naming, or I'm just not following exactly which variable
is which) in registers or TLABI, and the kernel would notice that
*counter != expected_counter when it resumes and would send us to the
abort label.
Presumably there would be one register hard-coded as the expected
counter value and another register hard-coded as the address of the
counter. That would avoid a bunch of stores to TLABI, and the kernel
could still find them. If the C optimizer did a good job, the
resulting code would be simple.
[1] In my scheme, there would be no reason for the counter to live in
TLABI. In contrast, there would be an excellent reason for it to
*not* live in TLABI -- user code could have more than one set of
counters, and they wouldn't interfere with each other. This would be
nice if one of them protects something that uses long enough code in
the critical sections that it's likely to be preempted.
>
> + "jnz %l[failed]\n"
> + "movq %[to_write], (%[target])\n"
> + "1: movq $0, %[commit_instr]\n"
> + : /* no outputs */
> + : [start_value]"d"(start_value.storage),
> + [current_value]"m"(__rseq_state),
> + [to_write]"r"(to_write),
> + [target]"r"(p),
> + [commit_instr]"m"(__rseq_state.post_commit_instr)
> + : "rcx", "memory"
> + : failed
> + );
More concretely, this looks like (using totally arbitrary register
assingments -- probably far from ideal, especially given how GCC's
constraints work):
enter the critical section:
1:
movq %[cpu], %%r12
movq {address of counter for our cpu}, %%r13
movq {some fresh value}, (%%r13)
cmpq %[cpu], %%r12
jne 1b
... do whatever setup or computation is needed...
movq $%l[failed], %%rcx
movq $1f, %[commit_instr]
cmpq {whatever counter we chose}, (%%r13)
jne %l[failed]
cmpq %[cpu], %%r12
jne %l[failed]
<-- a signal in here that conflicts with us would clobber (%%r13), and
the kernel would notice and send us to the failed label
movq %[to_write], (%[target])
1: movq $0, %[commit_instr]
In contrast to Paul's scheme, this has two additional (highly
predictable) branches and requires generation of a seqcount in
userspace. In its favor, though, it doesnt need preemption hooks,
it's inherently debuggable, and it allows multiple independent
rseq-protected things to coexist without forcing each other to abort.
The first cmpq might not be necessary. I put it there to ensure that
"some fresh value" is visible on the chosen CPU before any other loads
happen, but it's possible that the other cpu double-check is enough.
(Off topic: some of those movq instructions should probably be
rip-relative leaq instead.)
If my descriptor approach were used, we'd save the write to rcx.
Instead we'd do:
.pushsection .rodata
2:
.quad %l[failed] - . /* or similar */
.quad 1b - .
.popsection
leaq 2b(%rip), %[tlabi_rseq_desc]
That's the optimization I was talking about above, although I did a
bad job of describing it.
--Andy
On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
> More concretely, this looks like (using totally arbitrary register
> assingments -- probably far from ideal, especially given how GCC's
> constraints work):
>
> enter the critical section:
> 1:
> movq %[cpu], %%r12
> movq {address of counter for our cpu}, %%r13
> movq {some fresh value}, (%%r13)
> cmpq %[cpu], %%r12
> jne 1b
>
> ... do whatever setup or computation is needed...
>
> movq $%l[failed], %%rcx
> movq $1f, %[commit_instr]
> cmpq {whatever counter we chose}, (%%r13)
> jne %l[failed]
> cmpq %[cpu], %%r12
> jne %l[failed]
>
> <-- a signal in here that conflicts with us would clobber (%%r13), and
> the kernel would notice and send us to the failed label
>
> movq %[to_write], (%[target])
> 1: movq $0, %[commit_instr]
And the kernel, for every thread that has had the syscall called and a
thingy registered, needs to (at preempt/signal-setup):
if (get_user(post_commit_ip, current->post_commit_ip))
return -EFAULT;
if (likely(!post_commit_ip))
return 0;
if (regs->ip >= post_commit_ip)
return 0;
if (get_user(seq, (u32 __user *)regs->r13))
return -EFAULT;
if (regs->$(which one holds our chosen seq?) == seq) {
/* nothing changed, do not cancel, proceed to commit. */
return 0;
}
if (put_user(0UL, current->post_commit_ip))
return -EFAULT;
regs->ip = regs->rcx;
> In contrast to Paul's scheme, this has two additional (highly
> predictable) branches and requires generation of a seqcount in
> userspace. In its favor, though, it doesnt need preemption hooks,
Without preemption hooks, how would one thread preempting another at the
above <-- clobber anything and cause the commit to fail?
> it's inherently debuggable,
It is more debuggable, agreed.
> and it allows multiple independent
> rseq-protected things to coexist without forcing each other to abort.
And the kernel only needs to load the second cacheline if it lands in
the middle of a finish block, which should be manageable overhead I
suppose.
But the userspace chunk is lots slower as it needs to always touch
multiple lines, since the @cpu, @seq and @post_commit_ip all live in
separate lines (although I suppose @cpu and @post_commit_ip could live
in the same).
The finish thing needs 3 registers for:
- fail ip
- seq pointer
- seq value
Which I suppose is possible even on register constrained architectures
like i386.
On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
>> More concretely, this looks like (using totally arbitrary register
>> assingments -- probably far from ideal, especially given how GCC's
>> constraints work):
>>
>> enter the critical section:
>> 1:
>> movq %[cpu], %%r12
>> movq {address of counter for our cpu}, %%r13
>> movq {some fresh value}, (%%r13)
>> cmpq %[cpu], %%r12
>> jne 1b
>>
>> ... do whatever setup or computation is needed...
>>
>> movq $%l[failed], %%rcx
>> movq $1f, %[commit_instr]
>> cmpq {whatever counter we chose}, (%%r13)
>> jne %l[failed]
>> cmpq %[cpu], %%r12
>> jne %l[failed]
>>
>> <-- a signal in here that conflicts with us would clobber (%%r13), and
>> the kernel would notice and send us to the failed label
>>
>> movq %[to_write], (%[target])
>> 1: movq $0, %[commit_instr]
>
> And the kernel, for every thread that has had the syscall called and a
> thingy registered, needs to (at preempt/signal-setup):
>
> if (get_user(post_commit_ip, current->post_commit_ip))
> return -EFAULT;
>
> if (likely(!post_commit_ip))
> return 0;
>
> if (regs->ip >= post_commit_ip)
> return 0;
>
> if (get_user(seq, (u32 __user *)regs->r13))
> return -EFAULT;
>
> if (regs->$(which one holds our chosen seq?) == seq) {
> /* nothing changed, do not cancel, proceed to commit. */
> return 0;
Only return zero if regs->${which one holds the cpu) == smp_processor_id().
> }
>
> if (put_user(0UL, current->post_commit_ip))
> return -EFAULT;
>
> regs->ip = regs->rcx;
I was imagining this happening at (return to userspace or preempt) and
possibly at signal return, but yes, more or less.
>
>
>> In contrast to Paul's scheme, this has two additional (highly
>> predictable) branches and requires generation of a seqcount in
>> userspace. In its favor, though, it doesnt need preemption hooks,
>
> Without preemption hooks, how would one thread preempting another at the
> above <-- clobber anything and cause the commit to fail?
It doesn't, which is what I like about my variant. If the thread
accesses the protected data structure, though, it should bump the
sequence count, which will cause the first thread to about when it
gets scheduled in.
>
>> it's inherently debuggable,
>
> It is more debuggable, agreed.
>
>> and it allows multiple independent
>> rseq-protected things to coexist without forcing each other to abort.
>
> And the kernel only needs to load the second cacheline if it lands in
> the middle of a finish block, which should be manageable overhead I
> suppose.
>
> But the userspace chunk is lots slower as it needs to always touch
> multiple lines, since the @cpu, @seq and @post_commit_ip all live in
> separate lines (although I suppose @cpu and @post_commit_ip could live
> in the same).
>
> The finish thing needs 3 registers for:
>
> - fail ip
> - seq pointer
> - seq value
>
> Which I suppose is possible even on register constrained architectures
> like i386.
I think this can all be munged into two cachelines:
One cacheline contains the per-thread CPU number and post_commit_ip
(either by doing it over Linus' dead body or by having userspace
allocate it carefully). The other contains the sequence counter *and*
the percpu data structure that's protected. So in some sense it's the
same number of cache lines as Paul's version.
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
----- On Apr 7, 2016, at 6:05 PM, Andy Lutomirski [email protected] wrote:
> On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <[email protected]> wrote:
>> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
[...]
>>
>>> it's inherently debuggable,
>>
>> It is more debuggable, agreed.
>>
>>> and it allows multiple independent
>>> rseq-protected things to coexist without forcing each other to abort.
[...]
My understanding is that the main goal of this rather more complex
proposal is to make interaction with debuggers more straightforward in
cases of single-stepping through the rseq critical section.
I recently came up with a scheme that should allow us to handle such
situations in a fashion similar to debuggers handling ll/sc
restartable sequences of instructions on e.g. powerpc. The good news
is that my scheme does not require anything at the kernel level.
The idea is simple: the userspace rseq critical sections now
become marked by 3 inline functions (rather than 2 in Paul's proposal):
rseq_start(void *rseq_key)
rseq_finish(void *rseq_key)
rseq_abort(void *rseq_key)
We associate each critical section with a unique "key" (dummy
1 byte object in the process address space), so we can group
them. The new "rseq_abort" would mark exit points that would
exit the critical section without executing the final commit
instruction.
Within each of rseq_start, rseq_finish and rseq_abort,
we declare a non-loadable section that gets populated
with the following tuples:
(RSEQ_TYPE, insn address, rseq_key)
Where RSEQ_TYPE is either RSEQ_START, RSEQ_FINISH, or RSEQ_ABORT.
That special section would be found in the executable by the
debugger, which can then skip over entire restartable critical
sections when it encounters them by placing breakpoints at
all exit points (finish and cancel) associated to the same
rseq_key as the entry point (start).
This way we don't need to complexify the runtime code, neither
at kernel nor user-space level, and we get debuggability using
a trick similar to what ll/sc architectures already need to do.
Of course, this requires extending gdb, which should not be
a show-stopper.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, Apr 7, 2016 at 6:11 PM, Mathieu Desnoyers
<[email protected]> wrote:
> ----- On Apr 7, 2016, at 6:05 PM, Andy Lutomirski [email protected] wrote:
>
>> On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <[email protected]> wrote:
>>> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
> [...]
>>>
>>>> it's inherently debuggable,
>>>
>>> It is more debuggable, agreed.
>>>
>>>> and it allows multiple independent
>>>> rseq-protected things to coexist without forcing each other to abort.
>
> [...]
>
> My understanding is that the main goal of this rather more complex
> proposal is to make interaction with debuggers more straightforward in
> cases of single-stepping through the rseq critical section.
The things I like about my proposal are both that you can single-step
through it just like any other code as long as you pin the thread to a
CPU and that it doesn't make preemption magical. (Of course, you can
*force* it to do something on resume and/or preemption by sticking a
bogus value in the expected event count field, but that's not the
intended use. Hmm, I guess it does need to hook preemption and/or
resume for all processes that enable the thing so it can know to check
for an enabled post_commit_rip, just like all the other proposals.)
Also, mine lets you have a fairly long-running critical section that
doesn't get aborted under heavy load and can interleave with other
critical sections that don't conflict.
>
> I recently came up with a scheme that should allow us to handle such
> situations in a fashion similar to debuggers handling ll/sc
> restartable sequences of instructions on e.g. powerpc. The good news
> is that my scheme does not require anything at the kernel level.
>
> The idea is simple: the userspace rseq critical sections now
> become marked by 3 inline functions (rather than 2 in Paul's proposal):
>
> rseq_start(void *rseq_key)
> rseq_finish(void *rseq_key)
> rseq_abort(void *rseq_key)
How do you use this thing? What are its semantics?
--Andy
>
> We associate each critical section with a unique "key" (dummy
> 1 byte object in the process address space), so we can group
> them. The new "rseq_abort" would mark exit points that would
> exit the critical section without executing the final commit
> instruction.
>
> Within each of rseq_start, rseq_finish and rseq_abort,
> we declare a non-loadable section that gets populated
> with the following tuples:
>
> (RSEQ_TYPE, insn address, rseq_key)
>
> Where RSEQ_TYPE is either RSEQ_START, RSEQ_FINISH, or RSEQ_ABORT.
>
> That special section would be found in the executable by the
> debugger, which can then skip over entire restartable critical
> sections when it encounters them by placing breakpoints at
> all exit points (finish and cancel) associated to the same
> rseq_key as the entry point (start).
>
> This way we don't need to complexify the runtime code, neither
> at kernel nor user-space level, and we get debuggability using
> a trick similar to what ll/sc architectures already need to do.
>
> Of course, this requires extending gdb, which should not be
> a show-stopper.
>
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
--
Andy Lutomirski
AMA Capital Management, LLC
----- On Apr 7, 2016, at 9:21 PM, Andy Lutomirski [email protected] wrote:
> On Thu, Apr 7, 2016 at 6:11 PM, Mathieu Desnoyers
> <[email protected]> wrote:
>> ----- On Apr 7, 2016, at 6:05 PM, Andy Lutomirski [email protected] wrote:
>>
>>> On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <[email protected]> wrote:
>>>> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
>> [...]
>>>>
>>>>> it's inherently debuggable,
>>>>
>>>> It is more debuggable, agreed.
>>>>
>>>>> and it allows multiple independent
>>>>> rseq-protected things to coexist without forcing each other to abort.
>>
>> [...]
>>
>> My understanding is that the main goal of this rather more complex
>> proposal is to make interaction with debuggers more straightforward in
>> cases of single-stepping through the rseq critical section.
>
> The things I like about my proposal are both that you can single-step
> through it just like any other code as long as you pin the thread to a
> CPU and that it doesn't make preemption magical. (Of course, you can
> *force* it to do something on resume and/or preemption by sticking a
> bogus value in the expected event count field, but that's not the
> intended use. Hmm, I guess it does need to hook preemption and/or
> resume for all processes that enable the thing so it can know to check
> for an enabled post_commit_rip, just like all the other proposals.)
>
> Also, mine lets you have a fairly long-running critical section that
> doesn't get aborted under heavy load and can interleave with other
> critical sections that don't conflict.
Yes, those would be nice advantages. I'll have to do a few more
pseudo-code and execution scenarios to get a better understanding of
your idea.
>
>>
>> I recently came up with a scheme that should allow us to handle such
>> situations in a fashion similar to debuggers handling ll/sc
>> restartable sequences of instructions on e.g. powerpc. The good news
>> is that my scheme does not require anything at the kernel level.
>>
>> The idea is simple: the userspace rseq critical sections now
>> become marked by 3 inline functions (rather than 2 in Paul's proposal):
>>
>> rseq_start(void *rseq_key)
>> rseq_finish(void *rseq_key)
>> rseq_abort(void *rseq_key)
>
> How do you use this thing? What are its semantics?
You define one rseq_key variable (dummy 1 byte variable, can be an
empty structure) for each rseq critical section you have in your
program.
A rseq critical section will typically have one entry point (rseq_start),
and one exit point (rseq_finish). I'm saying "typically" because there
may be more than one entry point, and more than one exit point per
critical section.
Entry and exit points mark the beginning and end of each rseq critical
section. rseq_start loads the sequence counter from the TLS and copies
it onto the stack. It then gets passed to rseq_finish() to be compared
with the final seqnum TLS value just before the commit. rseq_finish is
the one responsible for storing into the post_commit_instr field of the
TLS and populating rcx with the failure insn label address. rseq_finish()
does the commit.
And there is rseq_abort(), which would need to be called if we just want
to exit from a rseq critical section without doing the commit (no matching
call to rseq_finish after a rseq_start).
Each of rseq_start, finish, and abort would need to receive a pointer
to the rseq_key as parameter.
rseq_start would return the sequence number read from the TLS.
rseq_finish would also receive as parameter that sequence number that has
been returned by rseq_start.
Does it make sense ?
Thanks,
Mathieu
>
> --Andy
>
>>
>> We associate each critical section with a unique "key" (dummy
>> 1 byte object in the process address space), so we can group
>> them. The new "rseq_abort" would mark exit points that would
>> exit the critical section without executing the final commit
>> instruction.
>>
>> Within each of rseq_start, rseq_finish and rseq_abort,
>> we declare a non-loadable section that gets populated
>> with the following tuples:
>>
>> (RSEQ_TYPE, insn address, rseq_key)
>>
>> Where RSEQ_TYPE is either RSEQ_START, RSEQ_FINISH, or RSEQ_ABORT.
>>
>> That special section would be found in the executable by the
>> debugger, which can then skip over entire restartable critical
>> sections when it encounters them by placing breakpoints at
>> all exit points (finish and cancel) associated to the same
>> rseq_key as the entry point (start).
>>
>> This way we don't need to complexify the runtime code, neither
>> at kernel nor user-space level, and we get debuggability using
>> a trick similar to what ll/sc architectures already need to do.
>>
>> Of course, this requires extending gdb, which should not be
>> a show-stopper.
>>
>> Thoughts ?
>>
>> Thanks,
>>
>> Mathieu
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
> enter the critical section:
> 1:
> movq %[cpu], %%r12
> movq {address of counter for our cpu}, %%r13
> movq {some fresh value}, (%%r13)
> cmpq %[cpu], %%r12
> jne 1b
This is inherently racy; your forgot the detail of 'some fresh value',
but since you want to avoid collisions you really want an increment.
But load-store archs cannot do that. Or rather, they need to do:
load Rn, $event
add Rn, Rn, 1
store $event, Rn
But if they're preempted in the middle, two threads will collide and
generate the _same_ increment. Comparing CPU numbers will not fix that.
On Thu, Apr 07, 2016 at 03:05:26PM -0700, Andy Lutomirski wrote:
> It doesn't, which is what I like about my variant. If the thread
> accesses the protected data structure, though, it should bump the
> sequence count, which will cause the first thread to about when it
> gets scheduled in.
Nope it won't, because that first thread is right at the commit
instruction, nothing will stop it from executing that store and clobbing
what we just wrote.
On Apr 8, 2016 4:04 AM, "Peter Zijlstra" <[email protected]> wrote:
>
> On Thu, Apr 07, 2016 at 03:05:26PM -0700, Andy Lutomirski wrote:
>
> > It doesn't, which is what I like about my variant. If the thread
> > accesses the protected data structure, though, it should bump the
> > sequence count, which will cause the first thread to about when it
> > gets scheduled in.
>
> Nope it won't, because that first thread is right at the commit
> instruction, nothing will stop it from executing that store and clobbing
> what we just wrote.
>
I don't think so. I write an event number. You commit because you
didn't notice. I haven't loaded yet from the value you wrote when you
committed, so nothing goes wrong.
--Andy
On Apr 7, 2016 11:41 PM, "Peter Zijlstra" <[email protected]> wrote:
>
> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
> > enter the critical section:
> > 1:
> > movq %[cpu], %%r12
> > movq {address of counter for our cpu}, %%r13
> > movq {some fresh value}, (%%r13)
> > cmpq %[cpu], %%r12
> > jne 1b
>
> This is inherently racy; your forgot the detail of 'some fresh value',
> but since you want to avoid collisions you really want an increment.
>
> But load-store archs cannot do that. Or rather, they need to do:
>
> load Rn, $event
> add Rn, Rn, 1
> store $event, Rn
>
> But if they're preempted in the middle, two threads will collide and
> generate the _same_ increment. Comparing CPU numbers will not fix that.
Even on x86 this won't work -- we have no actual guarantee we're on
the right CPU, so we'd have to use an atomic.
I was thinking we'd allocate from a per-thread pool (say 24 bits of
thread ID and the rest being a nonce). On load-store architectures
this wouldn't be async-signal-safe, though. Hmm.
--Andy
----- On Apr 7, 2016, at 10:05 PM, Mathieu Desnoyers [email protected] wrote:
> ----- On Apr 7, 2016, at 9:21 PM, Andy Lutomirski [email protected] wrote:
>
>> On Thu, Apr 7, 2016 at 6:11 PM, Mathieu Desnoyers
>> <[email protected]> wrote:
>>> ----- On Apr 7, 2016, at 6:05 PM, Andy Lutomirski [email protected] wrote:
>>>
>>>> On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <[email protected]> wrote:
>>>>> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
>>> [...]
>>>>>
>>>>>> it's inherently debuggable,
>>>>>
>>>>> It is more debuggable, agreed.
>>>>>
>>>>>> and it allows multiple independent
>>>>>> rseq-protected things to coexist without forcing each other to abort.
>>>
>>> [...]
>>>
>>> My understanding is that the main goal of this rather more complex
>>> proposal is to make interaction with debuggers more straightforward in
>>> cases of single-stepping through the rseq critical section.
>>
>> The things I like about my proposal are both that you can single-step
>> through it just like any other code as long as you pin the thread to a
>> CPU and that it doesn't make preemption magical. (Of course, you can
>> *force* it to do something on resume and/or preemption by sticking a
>> bogus value in the expected event count field, but that's not the
>> intended use. Hmm, I guess it does need to hook preemption and/or
>> resume for all processes that enable the thing so it can know to check
>> for an enabled post_commit_rip, just like all the other proposals.)
>>
>> Also, mine lets you have a fairly long-running critical section that
>> doesn't get aborted under heavy load and can interleave with other
>> critical sections that don't conflict.
>
> Yes, those would be nice advantages. I'll have to do a few more
> pseudo-code and execution scenarios to get a better understanding of
> your idea.
>
>>
>>>
>>> I recently came up with a scheme that should allow us to handle such
>>> situations in a fashion similar to debuggers handling ll/sc
>>> restartable sequences of instructions on e.g. powerpc. The good news
>>> is that my scheme does not require anything at the kernel level.
>>>
>>> The idea is simple: the userspace rseq critical sections now
>>> become marked by 3 inline functions (rather than 2 in Paul's proposal):
>>>
>>> rseq_start(void *rseq_key)
>>> rseq_finish(void *rseq_key)
>>> rseq_abort(void *rseq_key)
>>
>> How do you use this thing? What are its semantics?
>
> You define one rseq_key variable (dummy 1 byte variable, can be an
> empty structure) for each rseq critical section you have in your
> program.
>
> A rseq critical section will typically have one entry point (rseq_start),
> and one exit point (rseq_finish). I'm saying "typically" because there
> may be more than one entry point, and more than one exit point per
> critical section.
>
> Entry and exit points mark the beginning and end of each rseq critical
> section. rseq_start loads the sequence counter from the TLS and copies
> it onto the stack. It then gets passed to rseq_finish() to be compared
> with the final seqnum TLS value just before the commit. rseq_finish is
> the one responsible for storing into the post_commit_instr field of the
> TLS and populating rcx with the failure insn label address. rseq_finish()
> does the commit.
>
> And there is rseq_abort(), which would need to be called if we just want
> to exit from a rseq critical section without doing the commit (no matching
> call to rseq_finish after a rseq_start).
>
> Each of rseq_start, finish, and abort would need to receive a pointer
> to the rseq_key as parameter.
>
> rseq_start would return the sequence number read from the TLS.
>
> rseq_finish would also receive as parameter that sequence number that has
> been returned by rseq_start.
>
> Does it make sense ?
By the way, the debugger can always decide to single-step through the
first iteration of the rseq, and then after it loops, decide to skip
single-stepping until the exit points are reached.
Thanks,
Mathieu
>
> Thanks,
>
> Mathieu
>
>
>>
>> --Andy
>>
>>>
>>> We associate each critical section with a unique "key" (dummy
>>> 1 byte object in the process address space), so we can group
>>> them. The new "rseq_abort" would mark exit points that would
>>> exit the critical section without executing the final commit
>>> instruction.
>>>
>>> Within each of rseq_start, rseq_finish and rseq_abort,
>>> we declare a non-loadable section that gets populated
>>> with the following tuples:
>>>
>>> (RSEQ_TYPE, insn address, rseq_key)
>>>
>>> Where RSEQ_TYPE is either RSEQ_START, RSEQ_FINISH, or RSEQ_ABORT.
>>>
>>> That special section would be found in the executable by the
>>> debugger, which can then skip over entire restartable critical
>>> sections when it encounters them by placing breakpoints at
>>> all exit points (finish and cancel) associated to the same
>>> rseq_key as the entry point (start).
>>>
>>> This way we don't need to complexify the runtime code, neither
>>> at kernel nor user-space level, and we get debuggability using
>>> a trick similar to what ll/sc architectures already need to do.
>>>
>>> Of course, this requires extending gdb, which should not be
>>> a show-stopper.
>>>
>>> Thoughts ?
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>> --
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> http://www.efficios.com
>>
>>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Apr 8, 2016 10:46 AM, "Mathieu Desnoyers"
<[email protected]> wrote:
>
> ----- On Apr 7, 2016, at 10:05 PM, Mathieu Desnoyers [email protected] wrote:
>
> > ----- On Apr 7, 2016, at 9:21 PM, Andy Lutomirski [email protected] wrote:
> >
> >> On Thu, Apr 7, 2016 at 6:11 PM, Mathieu Desnoyers
> >> <[email protected]> wrote:
> >>> ----- On Apr 7, 2016, at 6:05 PM, Andy Lutomirski [email protected] wrote:
> >>>
> >>>> On Thu, Apr 7, 2016 at 1:11 PM, Peter Zijlstra <[email protected]> wrote:
> >>>>> On Thu, Apr 07, 2016 at 09:43:33AM -0700, Andy Lutomirski wrote:
> >>> [...]
> >>>>>
> >>>>>> it's inherently debuggable,
> >>>>>
> >>>>> It is more debuggable, agreed.
> >>>>>
> >>>>>> and it allows multiple independent
> >>>>>> rseq-protected things to coexist without forcing each other to abort.
> >>>
> >>> [...]
> >>>
> >>> My understanding is that the main goal of this rather more complex
> >>> proposal is to make interaction with debuggers more straightforward in
> >>> cases of single-stepping through the rseq critical section.
> >>
> >> The things I like about my proposal are both that you can single-step
> >> through it just like any other code as long as you pin the thread to a
> >> CPU and that it doesn't make preemption magical. (Of course, you can
> >> *force* it to do something on resume and/or preemption by sticking a
> >> bogus value in the expected event count field, but that's not the
> >> intended use. Hmm, I guess it does need to hook preemption and/or
> >> resume for all processes that enable the thing so it can know to check
> >> for an enabled post_commit_rip, just like all the other proposals.)
> >>
> >> Also, mine lets you have a fairly long-running critical section that
> >> doesn't get aborted under heavy load and can interleave with other
> >> critical sections that don't conflict.
> >
> > Yes, those would be nice advantages. I'll have to do a few more
> > pseudo-code and execution scenarios to get a better understanding of
> > your idea.
> >
> >>
> >>>
> >>> I recently came up with a scheme that should allow us to handle such
> >>> situations in a fashion similar to debuggers handling ll/sc
> >>> restartable sequences of instructions on e.g. powerpc. The good news
> >>> is that my scheme does not require anything at the kernel level.
> >>>
> >>> The idea is simple: the userspace rseq critical sections now
> >>> become marked by 3 inline functions (rather than 2 in Paul's proposal):
> >>>
> >>> rseq_start(void *rseq_key)
> >>> rseq_finish(void *rseq_key)
> >>> rseq_abort(void *rseq_key)
> >>
> >> How do you use this thing? What are its semantics?
> >
> > You define one rseq_key variable (dummy 1 byte variable, can be an
> > empty structure) for each rseq critical section you have in your
> > program.
> >
> > A rseq critical section will typically have one entry point (rseq_start),
> > and one exit point (rseq_finish). I'm saying "typically" because there
> > may be more than one entry point, and more than one exit point per
> > critical section.
> >
> > Entry and exit points mark the beginning and end of each rseq critical
> > section. rseq_start loads the sequence counter from the TLS and copies
> > it onto the stack. It then gets passed to rseq_finish() to be compared
> > with the final seqnum TLS value just before the commit. rseq_finish is
> > the one responsible for storing into the post_commit_instr field of the
> > TLS and populating rcx with the failure insn label address. rseq_finish()
> > does the commit.
> >
> > And there is rseq_abort(), which would need to be called if we just want
> > to exit from a rseq critical section without doing the commit (no matching
> > call to rseq_finish after a rseq_start).
> >
> > Each of rseq_start, finish, and abort would need to receive a pointer
> > to the rseq_key as parameter.
> >
> > rseq_start would return the sequence number read from the TLS.
> >
> > rseq_finish would also receive as parameter that sequence number that has
> > been returned by rseq_start.
> >
> > Does it make sense ?
>
> By the way, the debugger can always decide to single-step through the
> first iteration of the rseq, and then after it loops, decide to skip
> single-stepping until the exit points are reached.
True, but you're assuming that someone will actually write that code
and then users will know how to use it. That's something I like about
my version.
Admittedly, LL/SC and TSX have the same problem, but those are
architectural, and it's not really a good excuse to add a new thing
that's awkward to debug.
--Andy
On Fri, Apr 8, 2016 at 10:46 AM, Mathieu Desnoyers
<[email protected]> wrote:
>
> By the way, the debugger can always decide to single-step through the
> first iteration of the rseq, and then after it loops, decide to skip
> single-stepping until the exit points are reached.
A _human_ debugger may decide to do that yes.
But the the debugger _program_ may not be that smart. For example,
let's say that you - as a human - set a few watchpoints. The debugger
will use hardware breakpoints for the first few, but in more complex
cases the debugger will actually say "oops, no more hw breakpoints,
I'll just start single-stepping instead".
The human operator may not even be aware that the debugger has gone
into that slower mode. Normally it's just slower. But you'd want it to
be _only_ slower, not "oops, the program no longer makes any forward
progress at all, because a library that the user didn't even know or
care about - and never sees, because the single-stepping is all
internal = happened to use a code sequence that doesn't work under
single-stepping".
Linus
----- On Apr 8, 2016, at 5:25 PM, Linus Torvalds [email protected] wrote:
> On Fri, Apr 8, 2016 at 10:46 AM, Mathieu Desnoyers
> <[email protected]> wrote:
>>
>> By the way, the debugger can always decide to single-step through the
>> first iteration of the rseq, and then after it loops, decide to skip
>> single-stepping until the exit points are reached.
>
> A _human_ debugger may decide to do that yes.
>
> But the the debugger _program_ may not be that smart. For example,
> let's say that you - as a human - set a few watchpoints. The debugger
> will use hardware breakpoints for the first few, but in more complex
> cases the debugger will actually say "oops, no more hw breakpoints,
> I'll just start single-stepping instead".
>
> The human operator may not even be aware that the debugger has gone
> into that slower mode. Normally it's just slower. But you'd want it to
> be _only_ slower, not "oops, the program no longer makes any forward
> progress at all, because a library that the user didn't even know or
> care about - and never sees, because the single-stepping is all
> internal = happened to use a code sequence that doesn't work under
> single-stepping".
Which is why I'm proposing to extend gdb to support this automatically,
without requiring interaction or knowledge from the user.
The idea is to let gdb detect entry points into those restartable
critical sections as it single-steps through the program. It would
know about all rseq c.s. exit points too, so it can track whether
it has single-stepped over an entire rseq c.s. and thus caused a
restart. At that point, it can put the breakpoint at each exit point
associated with the entry point, thus skipping single-step of the
second iteration of the critical section.
I think this could be achieved by populating a section that contains
information about entry and exit points of those critical sections
in the rseq_{start,finish,abort} functions. Those sections would end
up in the app/lib ELF binary, may not have to be necessarily loaded
into program's memory.
Does it make sense to try it out, or am I missing something obvious ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Apr 7, 2016, at 12:43 PM, Andy Lutomirski [email protected] wrote:
> On Thu, Apr 7, 2016 at 8:53 AM, Peter Zijlstra <[email protected]> wrote:
>> On Thu, Apr 07, 2016 at 08:44:38AM -0700, Andy Lutomirski wrote:
>>> On Thu, Apr 7, 2016 at 8:24 AM, Peter Zijlstra <[email protected]> wrote:
>>> > On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
>>> >> What I meant was: rather than shoving individual values into the TLABI
>>> >> thing, shove in a pointer:
>>> >>
>>> >> struct commit_info {
>>> >> u64 post_commit_rip;
>>> >> u32 cpu;
>>> >> u64 *event;
>>> >> // whatever else;
>>> >> };
>>> >>
>>> >> and then put a commit_info* in TLABI.
>>> >>
>>> >> This would save some bytes in the TLABI structure.
>>> >
>>> > But would cost us extra indirections. The whole point was getting this
>>> > stuff at a constant offset from the TLS segment register.
>>>
>>> I don't think the extra indirections would matter much. The kernel
>>> would have to chase the pointer, but only in the very rare case where
>>> it resumes userspace during a commit or on the immediately following
>>> instruction.
>>
>> Its about userspace finding these values, not the kernel.
>
> But userspace never reads post_commit_rip or the abort address AFAIK.
>
>>
>>> At the very least, post_commit_rip and the abort address (which I
>>> forgot about) could both live in a static structure,
>>
>> Paul keeps the abort address in rcx.
>
> So it's probably a wash. Paul's way is to load the abort address in
> rcx (one instruction, no memory access) and to put post_commit_rip in
> TLABI (one store). Mine is to put a pointer to a descriptor into
> TLABI (one store). Should barely matter.
>
>>
>>> and shoving a
>>> pointer to *that* into TLABI space is one store instead of two.
>>
>>> > Ah, so what happens if the signal happens before the commit but after
>>> > the load of the seqcount?
>>> >
>>> > Then, even if the signal motifies the count, we'll not observe.
>>> >
>>>
>>> Where exactly?
>>>
>>> In my scheme, nothing except the kernel ever loads the seqcount. The
>>> user code generates a fresh value, writes it to memory, and then, just
>>> before commit, writes that same value to the TLABI area and then
>>> double-checks that the value it wrote at the beginning is still there.
>>>
>>> If the signal modifies the count, then the user code won't directly
>>> notice, but prepare_exit_to_usermode on the way out of the signal will
>>> notice that the (restored) TLABI state doesn't match the counter that
>>> the signal handler changed and will just to the abort address.
>>
>>
>> OK, you lost me.. commit looks like:
>>
>> + __asm__ __volatile__ goto (
>> + "movq $%l[failed], %%rcx\n"
>> + "movq $1f, %[commit_instr]\n"
>> + "cmpq %[start_value], %[current_value]\n"
>>
>> If we get preempted/signaled here without the preemption/signal entry
>> checking for the post_commit_instr, we'll fail hard.
>
> Not if I'm thinking about this right. Because, in my scheme, we'd be
> storing both the address of the counter [1] and the value that we
> expect it to be (current_value? something's confused in the asm or
> the variable naming, or I'm just not following exactly which variable
> is which) in registers or TLABI, and the kernel would notice that
> *counter != expected_counter when it resumes and would send us to the
> abort label.
>
> Presumably there would be one register hard-coded as the expected
> counter value and another register hard-coded as the address of the
> counter. That would avoid a bunch of stores to TLABI, and the kernel
> could still find them. If the C optimizer did a good job, the
> resulting code would be simple.
>
> [1] In my scheme, there would be no reason for the counter to live in
> TLABI. In contrast, there would be an excellent reason for it to
> *not* live in TLABI -- user code could have more than one set of
> counters, and they wouldn't interfere with each other. This would be
> nice if one of them protects something that uses long enough code in
> the critical sections that it's likely to be preempted.
>
>>
>> + "jnz %l[failed]\n"
>> + "movq %[to_write], (%[target])\n"
>> + "1: movq $0, %[commit_instr]\n"
>> + : /* no outputs */
>> + : [start_value]"d"(start_value.storage),
>> + [current_value]"m"(__rseq_state),
>> + [to_write]"r"(to_write),
>> + [target]"r"(p),
>> + [commit_instr]"m"(__rseq_state.post_commit_instr)
>> + : "rcx", "memory"
>> + : failed
>> + );
>
>
> More concretely, this looks like (using totally arbitrary register
> assingments -- probably far from ideal, especially given how GCC's
> constraints work):
>
I think you have an ABA-style issue here: if you are migrated
to another CPU at (A), and back to the original CPU at (B), your
CPU check is not doing much, and you may be corrupting (%%r13)
of a concurrently running thread.
> enter the critical section:
> 1:
> movq %[cpu], %%r12
> movq {address of counter for our cpu}, %%r13
(A) -> preempted, migrated to remote CPU.
> movq {some fresh value}, (%%r13)
(B) -> migrated back to original CPU.
> cmpq %[cpu], %%r12
> jne 1b
My understanding is that you should only ever _read_ from such
a per-cpu data structure upon entry into the critical section,
not write, because you have no way to know whether you are
migrated to another CPU at that point.
Those per-cpu data ideas are interesting though. I've came up
with the sketch of a scheme that fetches a seqcount from the percpu data,
and keeps a copy in the TLS area. Whenever the kernel preempts
a thread, it increments both the seqcount in the currently used
percpu data _and_ the seqcount in the TLS area. Then checking
if we need to restart the critical section is a matter of checking
that the two counters match, else it means the percpu data counter
has been touched by some other thread.
I'll try to come up with an implementation of this idea, because
there are a few more tricks to it to ensure all migration races
are dealt with.
One thing I doubt I'll be able to handle would be having nested
restartable critical sections (e.g. one c.s. nested in another,
but not through a nested signal handler). Not sure if we really
care about this use-case though.
Thanks,
Mathieu
>
> ... do whatever setup or computation is needed...
>
> movq $%l[failed], %%rcx
> movq $1f, %[commit_instr]
> cmpq {whatever counter we chose}, (%%r13)
> jne %l[failed]
> cmpq %[cpu], %%r12
> jne %l[failed]
>
> <-- a signal in here that conflicts with us would clobber (%%r13), and
> the kernel would notice and send us to the failed label
>
> movq %[to_write], (%[target])
> 1: movq $0, %[commit_instr]
>
> In contrast to Paul's scheme, this has two additional (highly
> predictable) branches and requires generation of a seqcount in
> userspace. In its favor, though, it doesnt need preemption hooks,
> it's inherently debuggable, and it allows multiple independent
> rseq-protected things to coexist without forcing each other to abort.
>
> The first cmpq might not be necessary. I put it there to ensure that
> "some fresh value" is visible on the chosen CPU before any other loads
> happen, but it's possible that the other cpu double-check is enough.
>
>
> (Off topic: some of those movq instructions should probably be
> rip-relative leaq instead.)
>
> If my descriptor approach were used, we'd save the write to rcx.
> Instead we'd do:
>
> .pushsection .rodata
> 2:
> .quad %l[failed] - . /* or similar */
> .quad 1b - .
> .popsection
> leaq 2b(%rip), %[tlabi_rseq_desc]
>
> That's the optimization I was talking about above, although I did a
> bad job of describing it.
>
> --Andy
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com