2021-04-13 22:59:02

by Arjun Roy

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

On Tue, Apr 13, 2021 at 11:22 AM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Apr 13, 2021 at 8:00 PM Mathieu Desnoyers
> <[email protected]> wrote:
> >
>
> > As long as the ifdefs are localized within clearly identified wrappers in the
> > rseq code I don't mind doing the special-casing there.
> >
> > The point which remains is that I don't think we want to optimize for speed
> > on 32-bit architectures when it adds special-casing and complexity to the 32-bit
> > build. I suspect there is less and less testing performed on 32-bit architectures
> > nowadays, and it's good that as much code as possible is shared between 32-bit and
> > 64-bit builds to share the test coverage.
> >
>
> Quite frankly V1 was fine, I can't really make it looking better.
>
> > Thanks,
> >
> > Mathieu
> >
> > >
> > >
> > >>
> > >> Thanks,
> > >>
> > >> Mathieu
> > >>

If we're special-casing 64-bit architectures anyways - unrolling the
32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
savings on x86-64 when I measured it (well, in a microbenchmark, not
in rseq_get_rseq_cs() directly). Perhaps that could be an additional
avenue for improvement here.

-Arjun

> > >> >
> > >> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > >> > index
> > >> > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
> > >> > 100644
> > >> > --- a/kernel/rseq.c
> > >> > +++ b/kernel/rseq.c
> > >> > @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
> > >> > {
> > >> > u32 ptr;
> > >> >
> > >> > + if (get_user(ptr, &rseq->rseq_cs.ptr.padding))
> > >> > + return -EFAULT;
> > >> > + if (ptr)
> > >> > + return -EINVAL;
> > >> > if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
> > >> > return -EFAULT;
> > >> > *uptrp = (struct rseq_cs __user *)ptr;
> > >> > @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
> > >> > struct rseq_cs *rseq_cs)
> > >> > u32 sig;
> > >> > int ret;
> > >> >
> > >> > - if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
> > >> > - return -EFAULT;
> > >> > + ret = rseq_get_cs_ptr(&urseq_cs, t->rseq);
> > >> > + if (ret)
> > >> > + return ret;
> > >> > if (!urseq_cs) {
> > >> > memset(rseq_cs, 0, sizeof(*rseq_cs));
> > >> > return 0;
> > >> > @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
> > >> > #ifdef CONFIG_64BIT
> > >> > return put_user(0UL, &t->rseq->rseq_cs.ptr64);
> > >> > #else
> > >> > - return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
> > >> > + return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
> > >> > + put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
> > >> > #endif
> > >> > }
> > >>
> > >> --
> > >> Mathieu Desnoyers
> > >> EfficiOS Inc.
> > > > http://www.efficios.com
> >
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > http://www.efficios.com


2021-04-14 06:57:03

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

> If we're special-casing 64-bit architectures anyways - unrolling the
> 32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
> savings on x86-64 when I measured it (well, in a microbenchmark, not
> in rseq_get_rseq_cs() directly). Perhaps that could be an additional
> avenue for improvement here.

The killer is usually 'user copy hardening'.
It significantly slows down sendmsg() and recvmsg().
I've got measurable performance improvements by
using __copy_from_user() when the buffer since has
already been checked - but isn't a compile-time constant.

There is also scope for using _get_user() when reading
iovec[] (instead of copy_from_user()) and doing all the
bound checks (etc) in the loop.
That gives a measurable improvement for writev("/dev/null").
I must sort those patches out again.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-04-14 08:43:39

by Arjun Roy

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

On Tue, Apr 13, 2021 at 2:19 PM David Laight <[email protected]> wrote:
>
> > If we're special-casing 64-bit architectures anyways - unrolling the
> > 32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
> > savings on x86-64 when I measured it (well, in a microbenchmark, not
> > in rseq_get_rseq_cs() directly). Perhaps that could be an additional
> > avenue for improvement here.
>
> The killer is usually 'user copy hardening'.
> It significantly slows down sendmsg() and recvmsg().
> I've got measurable performance improvements by
> using __copy_from_user() when the buffer since has
> already been checked - but isn't a compile-time constant.
>
> There is also scope for using _get_user() when reading
> iovec[] (instead of copy_from_user()) and doing all the
> bound checks (etc) in the loop.
> That gives a measurable improvement for writev("/dev/null").
> I must sort those patches out again.
>
> David
>

In this case I mean replacing copy_from_user(rseq_cs, urseq_cs,
sizeof(*rseq_cs)) with 4 (x8B=32 total) unsafe_get_user() (wrapped in
user_read_access_begin/end()) which I think would just bypass user
copy hardening (as far as I can tell).

-Arjun

> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2021-04-14 14:36:37

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

From: Arjun Roy
> Sent: 13 April 2021 23:04
>
> On Tue, Apr 13, 2021 at 2:19 PM David Laight <[email protected]> wrote:
> >
> > > If we're special-casing 64-bit architectures anyways - unrolling the
> > > 32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
> > > savings on x86-64 when I measured it (well, in a microbenchmark, not
> > > in rseq_get_rseq_cs() directly). Perhaps that could be an additional
> > > avenue for improvement here.
> >
> > The killer is usually 'user copy hardening'.
> > It significantly slows down sendmsg() and recvmsg().
> > I've got measurable performance improvements by
> > using __copy_from_user() when the buffer since has
> > already been checked - but isn't a compile-time constant.
> >
> > There is also scope for using _get_user() when reading
> > iovec[] (instead of copy_from_user()) and doing all the
> > bound checks (etc) in the loop.
> > That gives a measurable improvement for writev("/dev/null").
> > I must sort those patches out again.
> >
> > David
> >
>
> In this case I mean replacing copy_from_user(rseq_cs, urseq_cs,
> sizeof(*rseq_cs)) with 4 (x8B=32 total) unsafe_get_user() (wrapped in
> user_read_access_begin/end()) which I think would just bypass user
> copy hardening (as far as I can tell).

Yes that is one advantage over any of the get_user() calls.
You also lose all the 'how shall we optimise this' checks
in copy_from_user().

Repeated unsafe_get_user() calls are crying out for an optimisation.
You get something like:
failed = 0;
copy();
if (failed) goto error;
copy();
if (failed) goto error;
Where 'failed' is set by the fault handler.

This could be optimised to:
failed = 0;
copy();
copy();
if (failed) goto error;
Even if it faults on every invalid address it probably
doesn't matter - no one cares about that path.

I've not really looked at how it could be achieved though.

It might be that the 'asm goto with outputs' variant
manages to avoid the compare and jump.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-04-14 17:50:24

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

On Wed, Apr 14, 2021 at 9:55 AM David Laight <[email protected]> wrote:
>
> From: Arjun Roy
> > Sent: 13 April 2021 23:04
> >
> > On Tue, Apr 13, 2021 at 2:19 PM David Laight <[email protected]> wrote:
> > >
> > > > If we're special-casing 64-bit architectures anyways - unrolling the
> > > > 32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
> > > > savings on x86-64 when I measured it (well, in a microbenchmark, not
> > > > in rseq_get_rseq_cs() directly). Perhaps that could be an additional
> > > > avenue for improvement here.
> > >
> > > The killer is usually 'user copy hardening'.
> > > It significantly slows down sendmsg() and recvmsg().
> > > I've got measurable performance improvements by
> > > using __copy_from_user() when the buffer since has
> > > already been checked - but isn't a compile-time constant.
> > >
> > > There is also scope for using _get_user() when reading
> > > iovec[] (instead of copy_from_user()) and doing all the
> > > bound checks (etc) in the loop.
> > > That gives a measurable improvement for writev("/dev/null").
> > > I must sort those patches out again.
> > >
> > > David
> > >
> >
> > In this case I mean replacing copy_from_user(rseq_cs, urseq_cs,
> > sizeof(*rseq_cs)) with 4 (x8B=32 total) unsafe_get_user() (wrapped in
> > user_read_access_begin/end()) which I think would just bypass user
> > copy hardening (as far as I can tell).
>
> Yes that is one advantage over any of the get_user() calls.
> You also lose all the 'how shall we optimise this' checks
> in copy_from_user().
>
> Repeated unsafe_get_user() calls are crying out for an optimisation.
> You get something like:
> failed = 0;
> copy();
> if (failed) goto error;
> copy();
> if (failed) goto error;
> Where 'failed' is set by the fault handler.
>
> This could be optimised to:
> failed = 0;
> copy();
> copy();
> if (failed) goto error;
> Even if it faults on every invalid address it probably
> doesn't matter - no one cares about that path.


On which arch are you looking at ?

On x86_64 at least, code generation is just perfect.
Not even a conditional jmp, it is all handled by exceptions (if any)

stac
copy();
copy();
clac


<out_of_line>
efault_end: do error recovery.



>
> I've not really looked at how it could be achieved though.
>
> It might be that the 'asm goto with outputs' variant
> manages to avoid the compare and jump.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2021-04-14 17:56:58

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

From: Eric Dumazet
> Sent: 14 April 2021 17:00
...
> > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > You get something like:
> > failed = 0;
> > copy();
> > if (failed) goto error;
> > copy();
> > if (failed) goto error;
> > Where 'failed' is set by the fault handler.
> >
> > This could be optimised to:
> > failed = 0;
> > copy();
> > copy();
> > if (failed) goto error;
> > Even if it faults on every invalid address it probably
> > doesn't matter - no one cares about that path.
>
>
> On which arch are you looking at ?
>
> On x86_64 at least, code generation is just perfect.
> Not even a conditional jmp, it is all handled by exceptions (if any)
>
> stac
> copy();
> copy();
> clac
>
>
> <out_of_line>
> efault_end: do error recovery.

It will be x86_64.
I'm definitely seeing repeated tests of (IIRC) %rdx.

It may well be because the compiler isn't very new.
Will be an Ubuntu build of 9.3.0.
Does that support 'asm goto with outputs' - which
may be the difference.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-04-15 00:34:53

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

On Wed, Apr 14, 2021 at 6:08 PM David Laight <[email protected]> wrote:
>
> From: Eric Dumazet
> > Sent: 14 April 2021 17:00
> ...
> > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > You get something like:
> > > failed = 0;
> > > copy();
> > > if (failed) goto error;
> > > copy();
> > > if (failed) goto error;
> > > Where 'failed' is set by the fault handler.
> > >
> > > This could be optimised to:
> > > failed = 0;
> > > copy();
> > > copy();
> > > if (failed) goto error;
> > > Even if it faults on every invalid address it probably
> > > doesn't matter - no one cares about that path.
> >
> >
> > On which arch are you looking at ?
> >
> > On x86_64 at least, code generation is just perfect.
> > Not even a conditional jmp, it is all handled by exceptions (if any)
> >
> > stac
> > copy();
> > copy();
> > clac
> >
> >
> > <out_of_line>
> > efault_end: do error recovery.
>
> It will be x86_64.
> I'm definitely seeing repeated tests of (IIRC) %rdx.
>
> It may well be because the compiler isn't very new.
> Will be an Ubuntu build of 9.3.0.
> Does that support 'asm goto with outputs' - which
> may be the difference.
>

Yep, probably. I am using some recent clang version.

> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2021-04-15 00:38:11

by Arjun Roy

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet <[email protected]> wrote:
>
> On Wed, Apr 14, 2021 at 6:08 PM David Laight <[email protected]> wrote:
> >
> > From: Eric Dumazet
> > > Sent: 14 April 2021 17:00
> > ...
> > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > You get something like:
> > > > failed = 0;
> > > > copy();
> > > > if (failed) goto error;
> > > > copy();
> > > > if (failed) goto error;
> > > > Where 'failed' is set by the fault handler.
> > > >
> > > > This could be optimised to:
> > > > failed = 0;
> > > > copy();
> > > > copy();
> > > > if (failed) goto error;
> > > > Even if it faults on every invalid address it probably
> > > > doesn't matter - no one cares about that path.
> > >
> > >
> > > On which arch are you looking at ?
> > >
> > > On x86_64 at least, code generation is just perfect.
> > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > >
> > > stac
> > > copy();
> > > copy();
> > > clac
> > >
> > >
> > > <out_of_line>
> > > efault_end: do error recovery.
> >
> > It will be x86_64.
> > I'm definitely seeing repeated tests of (IIRC) %rdx.
> >
> > It may well be because the compiler isn't very new.
> > Will be an Ubuntu build of 9.3.0.
> > Does that support 'asm goto with outputs' - which
> > may be the difference.
> >
>
> Yep, probably. I am using some recent clang version.
>

On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
down to stac + lfence + 8 x mov + clac as straight line code. And
results in roughly a 5%-10% speedup over copy_from_user().

-Arjun


> > David
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)

2021-04-15 00:38:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy <[email protected]> wrote:
>
> On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Wed, Apr 14, 2021 at 6:08 PM David Laight <[email protected]> wrote:
> > >
> > > From: Eric Dumazet
> > > > Sent: 14 April 2021 17:00
> > > ...
> > > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > > You get something like:
> > > > > failed = 0;
> > > > > copy();
> > > > > if (failed) goto error;
> > > > > copy();
> > > > > if (failed) goto error;
> > > > > Where 'failed' is set by the fault handler.
> > > > >
> > > > > This could be optimised to:
> > > > > failed = 0;
> > > > > copy();
> > > > > copy();
> > > > > if (failed) goto error;
> > > > > Even if it faults on every invalid address it probably
> > > > > doesn't matter - no one cares about that path.
> > > >
> > > >
> > > > On which arch are you looking at ?
> > > >
> > > > On x86_64 at least, code generation is just perfect.
> > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > >
> > > > stac
> > > > copy();
> > > > copy();
> > > > clac
> > > >
> > > >
> > > > <out_of_line>
> > > > efault_end: do error recovery.
> > >
> > > It will be x86_64.
> > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > >
> > > It may well be because the compiler isn't very new.
> > > Will be an Ubuntu build of 9.3.0.
> > > Does that support 'asm goto with outputs' - which
> > > may be the difference.
> > >
> >
> > Yep, probably. I am using some recent clang version.
> >
>
> On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> down to stac + lfence + 8 x mov + clac as straight line code. And
> results in roughly a 5%-10% speedup over copy_from_user().
>

But rseq_get_rseq_cs() would still need three different copies,
with 3 stac+lfence+clac sequences.

Maybe we need to enclose all __rseq_handle_notify_resume() operations
in a single section.







> -Arjun
>
>
> > > David
> > >
> > > -
> > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > Registration No: 1397386 (Wales)

2021-04-15 00:45:39

by Arjun Roy

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

On Wed, Apr 14, 2021 at 10:35 AM Eric Dumazet <[email protected]> wrote:
>
> On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy <[email protected]> wrote:
> >
> > On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Wed, Apr 14, 2021 at 6:08 PM David Laight <[email protected]> wrote:
> > > >
> > > > From: Eric Dumazet
> > > > > Sent: 14 April 2021 17:00
> > > > ...
> > > > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > > > You get something like:
> > > > > > failed = 0;
> > > > > > copy();
> > > > > > if (failed) goto error;
> > > > > > copy();
> > > > > > if (failed) goto error;
> > > > > > Where 'failed' is set by the fault handler.
> > > > > >
> > > > > > This could be optimised to:
> > > > > > failed = 0;
> > > > > > copy();
> > > > > > copy();
> > > > > > if (failed) goto error;
> > > > > > Even if it faults on every invalid address it probably
> > > > > > doesn't matter - no one cares about that path.
> > > > >
> > > > >
> > > > > On which arch are you looking at ?
> > > > >
> > > > > On x86_64 at least, code generation is just perfect.
> > > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > > >
> > > > > stac
> > > > > copy();
> > > > > copy();
> > > > > clac
> > > > >
> > > > >
> > > > > <out_of_line>
> > > > > efault_end: do error recovery.
> > > >
> > > > It will be x86_64.
> > > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > > >
> > > > It may well be because the compiler isn't very new.
> > > > Will be an Ubuntu build of 9.3.0.
> > > > Does that support 'asm goto with outputs' - which
> > > > may be the difference.
> > > >
> > >
> > > Yep, probably. I am using some recent clang version.
> > >
> >
> > On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> > down to stac + lfence + 8 x mov + clac as straight line code. And
> > results in roughly a 5%-10% speedup over copy_from_user().
> >
>
> But rseq_get_rseq_cs() would still need three different copies,
> with 3 stac+lfence+clac sequences.
>
> Maybe we need to enclose all __rseq_handle_notify_resume() operations
> in a single section.
>
>

To provide a bit of further exposition on this point, if you do 4x
unsafe_get_user() recall I mentioned a 5-10% improvement. On the other
hand, 4x normal get_user() I saw something like a 100% (ie. doubling
of sys time measured) regression.

I assume that's the fault of multiple stac+clac.

-Arjun

>
>
>
>
>
> > -Arjun
> >
> >
> > > > David
> > > >
> > > > -
> > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > > Registration No: 1397386 (Wales)

2021-04-15 00:47:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

On Wed, Apr 14, 2021 at 10:15 PM Arjun Roy <[email protected]> wrote:
>
> On Wed, Apr 14, 2021 at 10:35 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy <[email protected]> wrote:
> > >
> > > On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet <[email protected]> wrote:
> > > >
> > > > On Wed, Apr 14, 2021 at 6:08 PM David Laight <[email protected]> wrote:
> > > > >
> > > > > From: Eric Dumazet
> > > > > > Sent: 14 April 2021 17:00
> > > > > ...
> > > > > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > > > > You get something like:
> > > > > > > failed = 0;
> > > > > > > copy();
> > > > > > > if (failed) goto error;
> > > > > > > copy();
> > > > > > > if (failed) goto error;
> > > > > > > Where 'failed' is set by the fault handler.
> > > > > > >
> > > > > > > This could be optimised to:
> > > > > > > failed = 0;
> > > > > > > copy();
> > > > > > > copy();
> > > > > > > if (failed) goto error;
> > > > > > > Even if it faults on every invalid address it probably
> > > > > > > doesn't matter - no one cares about that path.
> > > > > >
> > > > > >
> > > > > > On which arch are you looking at ?
> > > > > >
> > > > > > On x86_64 at least, code generation is just perfect.
> > > > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > > > >
> > > > > > stac
> > > > > > copy();
> > > > > > copy();
> > > > > > clac
> > > > > >
> > > > > >
> > > > > > <out_of_line>
> > > > > > efault_end: do error recovery.
> > > > >
> > > > > It will be x86_64.
> > > > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > > > >
> > > > > It may well be because the compiler isn't very new.
> > > > > Will be an Ubuntu build of 9.3.0.
> > > > > Does that support 'asm goto with outputs' - which
> > > > > may be the difference.
> > > > >
> > > >
> > > > Yep, probably. I am using some recent clang version.
> > > >
> > >
> > > On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> > > down to stac + lfence + 8 x mov + clac as straight line code. And
> > > results in roughly a 5%-10% speedup over copy_from_user().
> > >
> >
> > But rseq_get_rseq_cs() would still need three different copies,
> > with 3 stac+lfence+clac sequences.
> >
> > Maybe we need to enclose all __rseq_handle_notify_resume() operations
> > in a single section.
> >
> >
>
> To provide a bit of further exposition on this point, if you do 4x
> unsafe_get_user() recall I mentioned a 5-10% improvement. On the other
> hand, 4x normal get_user() I saw something like a 100% (ie. doubling
> of sys time measured) regression.
>
> I assume that's the fault of multiple stac+clac.


I was suggesting only using unsafe_get_user() and unsafe_put_user(),
and one surrounding stac/clac

Basically what we had (partially) in our old Google kernels, before
commit 8f2817701492 ("rseq: Use get_user/put_user rather than
__get_user/__put_user")
but with all the needed modern stuff.

2021-04-15 00:49:07

by Arjun Roy

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

On Wed, Apr 14, 2021 at 1:25 PM Eric Dumazet <[email protected]> wrote:
>
> On Wed, Apr 14, 2021 at 10:15 PM Arjun Roy <[email protected]> wrote:
> >
> > On Wed, Apr 14, 2021 at 10:35 AM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy <[email protected]> wrote:
> > > >
> > > > On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet <[email protected]> wrote:
> > > > >
> > > > > On Wed, Apr 14, 2021 at 6:08 PM David Laight <[email protected]> wrote:
> > > > > >
> > > > > > From: Eric Dumazet
> > > > > > > Sent: 14 April 2021 17:00
> > > > > > ...
> > > > > > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > > > > > You get something like:
> > > > > > > > failed = 0;
> > > > > > > > copy();
> > > > > > > > if (failed) goto error;
> > > > > > > > copy();
> > > > > > > > if (failed) goto error;
> > > > > > > > Where 'failed' is set by the fault handler.
> > > > > > > >
> > > > > > > > This could be optimised to:
> > > > > > > > failed = 0;
> > > > > > > > copy();
> > > > > > > > copy();
> > > > > > > > if (failed) goto error;
> > > > > > > > Even if it faults on every invalid address it probably
> > > > > > > > doesn't matter - no one cares about that path.
> > > > > > >
> > > > > > >
> > > > > > > On which arch are you looking at ?
> > > > > > >
> > > > > > > On x86_64 at least, code generation is just perfect.
> > > > > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > > > > >
> > > > > > > stac
> > > > > > > copy();
> > > > > > > copy();
> > > > > > > clac
> > > > > > >
> > > > > > >
> > > > > > > <out_of_line>
> > > > > > > efault_end: do error recovery.
> > > > > >
> > > > > > It will be x86_64.
> > > > > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > > > > >
> > > > > > It may well be because the compiler isn't very new.
> > > > > > Will be an Ubuntu build of 9.3.0.
> > > > > > Does that support 'asm goto with outputs' - which
> > > > > > may be the difference.
> > > > > >
> > > > >
> > > > > Yep, probably. I am using some recent clang version.
> > > > >
> > > >
> > > > On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> > > > down to stac + lfence + 8 x mov + clac as straight line code. And
> > > > results in roughly a 5%-10% speedup over copy_from_user().
> > > >
> > >
> > > But rseq_get_rseq_cs() would still need three different copies,
> > > with 3 stac+lfence+clac sequences.
> > >
> > > Maybe we need to enclose all __rseq_handle_notify_resume() operations
> > > in a single section.
> > >
> > >
> >
> > To provide a bit of further exposition on this point, if you do 4x
> > unsafe_get_user() recall I mentioned a 5-10% improvement. On the other
> > hand, 4x normal get_user() I saw something like a 100% (ie. doubling
> > of sys time measured) regression.
> >
> > I assume that's the fault of multiple stac+clac.
>
>
> I was suggesting only using unsafe_get_user() and unsafe_put_user(),
> and one surrounding stac/clac
>
> Basically what we had (partially) in our old Google kernels, before
> commit 8f2817701492 ("rseq: Use get_user/put_user rather than
> __get_user/__put_user")
> but with all the needed modern stuff.

Yep - in agreement with that approach.

-Arjun