2024-04-04 22:51:31

by Paul E. McKenney

[permalink] [raw]
Subject: Finding open-coded workarounds for 1/2-byte cmpxchg()?

Hello, Julia!

I hope that things are going well for you and yours.

TL;DR: Would you or one of your students be interested in looking for
some interesting code patterns involving cmpxchg? If such patterns exist,
we would either need to provide fixes or to drop support for old systems.

If this would be of interest, please read on!

Arnd (CCed) and I are looking for open-coded emulations for one-byte
and two-byte cmpxchg(). Such emulations might be attempting to work
around the fact that not all architectures support those sizes, being
as they are only required to support four-byte cmpxchg() and, if they
are 64-bit architectures, eight-byte cmpxchg().

There is a one-byte emulation in RCU (kernel/rcu/tasks.h), which looks
like this:

------------------------------------------------------------------------

u8 rcu_trc_cmpxchg_need_qs(struct task_struct *t, u8 old, u8 new)
{
union rcu_special ret;
union rcu_special trs_old = READ_ONCE(t->trc_reader_special);
union rcu_special trs_new = trs_old;

if (trs_old.b.need_qs != old)
return trs_old.b.need_qs;
trs_new.b.need_qs = new;
ret.s = cmpxchg(&t->trc_reader_special.s, trs_old.s, trs_new.s);
return ret.b.need_qs;
}

------------------------------------------------------------------------

An additional issue is posed by these, also in kernel/rcu/tasks.h:

------------------------------------------------------------------------

if (trs.b.need_qs == (TRC_NEED_QS_CHECKED | TRC_NEED_QS)) {

return smp_load_acquire(&t->trc_reader_special.b.need_qs);

smp_store_release(&t->trc_reader_special.b.need_qs, v);

------------------------------------------------------------------------

The additional issue is that these statements assume that each CPU
architecture has single-byte load and store instructions, which some of
the older Alpha systems do not. Fortunately for me, Arnd was already
thinking in terms of removing support for these systems.

But there are additional systems that do not support 16-bit loads and
stores. So if there is a 16-bit counterpart to rcu_trc_cmpxchg_need_qs()
on a quantity that is also subject to 16-bit loads or stores, either
that function needs adjustment or a few more ancient systems need to
lose their Linux-kernel support.

Again, is looking for this sort of thing something that you or one of
your students would be interested in?

Thanx, Paul


2024-04-05 23:00:55

by Julia Lawall

[permalink] [raw]
Subject: Re: Finding open-coded workarounds for 1/2-byte cmpxchg()?



On Thu, 4 Apr 2024, Paul E. McKenney wrote:

> Hello, Julia!
>
> I hope that things are going well for you and yours.
>
> TL;DR: Would you or one of your students be interested in looking for
> some interesting code patterns involving cmpxchg? If such patterns exist,
> we would either need to provide fixes or to drop support for old systems.
>
> If this would be of interest, please read on!
>
> Arnd (CCed) and I are looking for open-coded emulations for one-byte
> and two-byte cmpxchg(). Such emulations might be attempting to work
> around the fact that not all architectures support those sizes, being
> as they are only required to support four-byte cmpxchg() and, if they
> are 64-bit architectures, eight-byte cmpxchg().
>
> There is a one-byte emulation in RCU (kernel/rcu/tasks.h), which looks
> like this:
>
> ------------------------------------------------------------------------
>
> u8 rcu_trc_cmpxchg_need_qs(struct task_struct *t, u8 old, u8 new)
> {
> union rcu_special ret;
> union rcu_special trs_old = READ_ONCE(t->trc_reader_special);
> union rcu_special trs_new = trs_old;
>
> if (trs_old.b.need_qs != old)
> return trs_old.b.need_qs;
> trs_new.b.need_qs = new;
> ret.s = cmpxchg(&t->trc_reader_special.s, trs_old.s, trs_new.s);
> return ret.b.need_qs;
> }
>
> ------------------------------------------------------------------------
>
> An additional issue is posed by these, also in kernel/rcu/tasks.h:
>
> ------------------------------------------------------------------------
>
> if (trs.b.need_qs == (TRC_NEED_QS_CHECKED | TRC_NEED_QS)) {
>
> return smp_load_acquire(&t->trc_reader_special.b.need_qs);
>
> smp_store_release(&t->trc_reader_special.b.need_qs, v);
>
> ------------------------------------------------------------------------
>
> The additional issue is that these statements assume that each CPU
> architecture has single-byte load and store instructions, which some of
> the older Alpha systems do not. Fortunately for me, Arnd was already
> thinking in terms of removing support for these systems.
>
> But there are additional systems that do not support 16-bit loads and
> stores. So if there is a 16-bit counterpart to rcu_trc_cmpxchg_need_qs()
> on a quantity that is also subject to 16-bit loads or stores, either
> that function needs adjustment or a few more ancient systems need to
> lose their Linux-kernel support.
>
> Again, is looking for this sort of thing something that you or one of
> your students would be interested in?

Hello,

I tried, but without much success. The following looks a little bit
promising, eg the use of the variable name "want", but it's not clear that
the rest of the context fits the pattern.

diff -u -p /home/julia/linux/net/sunrpc/xprtsock.c
/tmp/nothing/net/sunrpc/xprtsock.c
--- /home/julia/linux/net/sunrpc/xprtsock.c
+++ /tmp/nothing/net/sunrpc/xprtsock.c
@@ -690,12 +690,9 @@ xs_read_stream(struct sock_xprt *transpo
if (ret <= 0)
goto out_err;
transport->recv.offset = ret;
- if (transport->recv.offset != want)
- return transport->recv.offset;

The semantic patch in question was:

@r@
expression olde;
idexpression old;
@@

if (olde != old) { ... return olde; }

@@
expression newe != r.olde;
idexpression nw;
expression r.olde;
idexpression r.old;
@@

*if (olde != old) { ... return olde; }
..
*newe = nw;
..
*return newe;

The semantic patch doesn't include the cmpxchg. I wasn't sure if that
would always be present, or in what form.

julia

2024-04-05 23:18:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Finding open-coded workarounds for 1/2-byte cmpxchg()?

On Sat, Apr 06, 2024 at 01:00:35AM +0200, Julia Lawall wrote:
>
>
> On Thu, 4 Apr 2024, Paul E. McKenney wrote:
>
> > Hello, Julia!
> >
> > I hope that things are going well for you and yours.
> >
> > TL;DR: Would you or one of your students be interested in looking for
> > some interesting code patterns involving cmpxchg? If such patterns exist,
> > we would either need to provide fixes or to drop support for old systems.
> >
> > If this would be of interest, please read on!
> >
> > Arnd (CCed) and I are looking for open-coded emulations for one-byte
> > and two-byte cmpxchg(). Such emulations might be attempting to work
> > around the fact that not all architectures support those sizes, being
> > as they are only required to support four-byte cmpxchg() and, if they
> > are 64-bit architectures, eight-byte cmpxchg().
> >
> > There is a one-byte emulation in RCU (kernel/rcu/tasks.h), which looks
> > like this:
> >
> > ------------------------------------------------------------------------
> >
> > u8 rcu_trc_cmpxchg_need_qs(struct task_struct *t, u8 old, u8 new)
> > {
> > union rcu_special ret;
> > union rcu_special trs_old = READ_ONCE(t->trc_reader_special);
> > union rcu_special trs_new = trs_old;
> >
> > if (trs_old.b.need_qs != old)
> > return trs_old.b.need_qs;
> > trs_new.b.need_qs = new;
> > ret.s = cmpxchg(&t->trc_reader_special.s, trs_old.s, trs_new.s);
> > return ret.b.need_qs;
> > }
> >
> > ------------------------------------------------------------------------
> >
> > An additional issue is posed by these, also in kernel/rcu/tasks.h:
> >
> > ------------------------------------------------------------------------
> >
> > if (trs.b.need_qs == (TRC_NEED_QS_CHECKED | TRC_NEED_QS)) {
> >
> > return smp_load_acquire(&t->trc_reader_special.b.need_qs);
> >
> > smp_store_release(&t->trc_reader_special.b.need_qs, v);
> >
> > ------------------------------------------------------------------------
> >
> > The additional issue is that these statements assume that each CPU
> > architecture has single-byte load and store instructions, which some of
> > the older Alpha systems do not. Fortunately for me, Arnd was already
> > thinking in terms of removing support for these systems.
> >
> > But there are additional systems that do not support 16-bit loads and
> > stores. So if there is a 16-bit counterpart to rcu_trc_cmpxchg_need_qs()
> > on a quantity that is also subject to 16-bit loads or stores, either
> > that function needs adjustment or a few more ancient systems need to
> > lose their Linux-kernel support.
> >
> > Again, is looking for this sort of thing something that you or one of
> > your students would be interested in?
>
> Hello,
>
> I tried, but without much success. The following looks a little bit
> promising, eg the use of the variable name "want", but it's not clear that
> the rest of the context fits the pattern.

Thank you for digging into this!!!

> diff -u -p /home/julia/linux/net/sunrpc/xprtsock.c
> /tmp/nothing/net/sunrpc/xprtsock.c
> --- /home/julia/linux/net/sunrpc/xprtsock.c
> +++ /tmp/nothing/net/sunrpc/xprtsock.c
> @@ -690,12 +690,9 @@ xs_read_stream(struct sock_xprt *transpo
> if (ret <= 0)
> goto out_err;
> transport->recv.offset = ret;
> - if (transport->recv.offset != want)
> - return transport->recv.offset;

Agreed, though you are quite right that ->recv.copied and ->recv.offset
are different lengths. But yes, as you sugggest below, there must be
a cmpxchg() of some type (cmpxchg(), cmpxchg_acquire(), ...) in the mix
somewhere. Also, the cmpxchg() must be applied to a pointer to either
a 32-bit or a 64-bit quantity, but the change must be 16 bits (or 8 bits).

> The semantic patch in question was:
>
> @r@
> expression olde;
> idexpression old;
> @@
>
> if (olde != old) { ... return olde; }
>
> @@
> expression newe != r.olde;
> idexpression nw;
> expression r.olde;
> idexpression r.old;
> @@
>
> *if (olde != old) { ... return olde; }
> ...
> *newe = nw;
> ...
> *return newe;
>
> The semantic patch doesn't include the cmpxchg. I wasn't sure if that
> would always be present, or in what form.

It would be, but I am having trouble characterizing exactly what the
pattern would look like beyond "emulating a 16-bit cmpxchg() using either
a 32-bit cmpxchg() or a 64-bit cmpxchg()". :-(

Thank you again, and something to think more about.

Thanx, Paul

2024-04-09 19:27:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Finding open-coded workarounds for 1/2-byte cmpxchg()?

On Fri, Apr 05, 2024 at 04:18:29PM -0700, Paul E. McKenney wrote:
> On Sat, Apr 06, 2024 at 01:00:35AM +0200, Julia Lawall wrote:
> >
> >
> > On Thu, 4 Apr 2024, Paul E. McKenney wrote:
> >
> > > Hello, Julia!
> > >
> > > I hope that things are going well for you and yours.
> > >
> > > TL;DR: Would you or one of your students be interested in looking for
> > > some interesting code patterns involving cmpxchg? If such patterns exist,
> > > we would either need to provide fixes or to drop support for old systems.
> > >
> > > If this would be of interest, please read on!
> > >
> > > Arnd (CCed) and I are looking for open-coded emulations for one-byte
> > > and two-byte cmpxchg(). Such emulations might be attempting to work
> > > around the fact that not all architectures support those sizes, being
> > > as they are only required to support four-byte cmpxchg() and, if they
> > > are 64-bit architectures, eight-byte cmpxchg().
> > >
> > > There is a one-byte emulation in RCU (kernel/rcu/tasks.h), which looks
> > > like this:
> > >
> > > ------------------------------------------------------------------------
> > >
> > > u8 rcu_trc_cmpxchg_need_qs(struct task_struct *t, u8 old, u8 new)
> > > {
> > > union rcu_special ret;
> > > union rcu_special trs_old = READ_ONCE(t->trc_reader_special);
> > > union rcu_special trs_new = trs_old;
> > >
> > > if (trs_old.b.need_qs != old)
> > > return trs_old.b.need_qs;
> > > trs_new.b.need_qs = new;
> > > ret.s = cmpxchg(&t->trc_reader_special.s, trs_old.s, trs_new.s);
> > > return ret.b.need_qs;
> > > }
> > >
> > > ------------------------------------------------------------------------
> > >
> > > An additional issue is posed by these, also in kernel/rcu/tasks.h:
> > >
> > > ------------------------------------------------------------------------
> > >
> > > if (trs.b.need_qs == (TRC_NEED_QS_CHECKED | TRC_NEED_QS)) {
> > >
> > > return smp_load_acquire(&t->trc_reader_special.b.need_qs);
> > >
> > > smp_store_release(&t->trc_reader_special.b.need_qs, v);
> > >
> > > ------------------------------------------------------------------------
> > >
> > > The additional issue is that these statements assume that each CPU
> > > architecture has single-byte load and store instructions, which some of
> > > the older Alpha systems do not. Fortunately for me, Arnd was already
> > > thinking in terms of removing support for these systems.
> > >
> > > But there are additional systems that do not support 16-bit loads and
> > > stores. So if there is a 16-bit counterpart to rcu_trc_cmpxchg_need_qs()
> > > on a quantity that is also subject to 16-bit loads or stores, either
> > > that function needs adjustment or a few more ancient systems need to
> > > lose their Linux-kernel support.
> > >
> > > Again, is looking for this sort of thing something that you or one of
> > > your students would be interested in?
> >
> > Hello,
> >
> > I tried, but without much success. The following looks a little bit
> > promising, eg the use of the variable name "want", but it's not clear that
> > the rest of the context fits the pattern.
>
> Thank you for digging into this!!!
>
> > diff -u -p /home/julia/linux/net/sunrpc/xprtsock.c
> > /tmp/nothing/net/sunrpc/xprtsock.c
> > --- /home/julia/linux/net/sunrpc/xprtsock.c
> > +++ /tmp/nothing/net/sunrpc/xprtsock.c
> > @@ -690,12 +690,9 @@ xs_read_stream(struct sock_xprt *transpo
> > if (ret <= 0)
> > goto out_err;
> > transport->recv.offset = ret;
> > - if (transport->recv.offset != want)
> > - return transport->recv.offset;
>
> Agreed, though you are quite right that ->recv.copied and ->recv.offset
> are different lengths. But yes, as you sugggest below, there must be
> a cmpxchg() of some type (cmpxchg(), cmpxchg_acquire(), ...) in the mix
> somewhere. Also, the cmpxchg() must be applied to a pointer to either
> a 32-bit or a 64-bit quantity, but the change must be 16 bits (or 8 bits).
>
> > The semantic patch in question was:
> >
> > @r@
> > expression olde;
> > idexpression old;
> > @@
> >
> > if (olde != old) { ... return olde; }
> >
> > @@
> > expression newe != r.olde;
> > idexpression nw;
> > expression r.olde;
> > idexpression r.old;
> > @@
> >
> > *if (olde != old) { ... return olde; }
> > ...
> > *newe = nw;
> > ...
> > *return newe;
> >
> > The semantic patch doesn't include the cmpxchg. I wasn't sure if that
> > would always be present, or in what form.
>
> It would be, but I am having trouble characterizing exactly what the
> pattern would look like beyond "emulating a 16-bit cmpxchg() using either
> a 32-bit cmpxchg() or a 64-bit cmpxchg()". :-(
>
> Thank you again, and something to think more about.

I took the crude approach of looking at all of the cmpxchg*() invocations,
discarding those that were clearly not an issue. Here are the close calls
that I found:

o drivers/misc/genwqe/card_ddcb.c enqueue_ddcb() does work against
a union that has 32-bit, 16-bit, and eight-bit members, but as
far as I can see the ->icrc_16 member is not used. But this
might be an accident waiting to happen. Or maybe this driver
is used only by architectures with a full set of cmpxchg sizes.

The 8-bit ->hsi and ->shi fields are used for debug output, which
should be harmless.

Ditto __genwqe_purge_ddcb() that same file.

o drivers/platform/surface/aggregator/controller.c ssh_seq_next()
does an 8-bit cmpxchg().

o drivers/platform/surface/aggregator/controller.c ssh_rqid_next()
does a 16-bit cmpxchg().

o kernel/locking/qspinlock_paravirt.h pv_wait_node() does
an 8-bit cmpxchg(). As does pv_kick_node() in that same
file. And __pv_queued_spin_unlock(). And, as Arnd noted,
trylock_clear_pending() and pv_hybrid_queued_unfair_trylock()
in that same file do 16-bit cmpxchg_acquire().

o net/rxrpc/io_thread.c rxrpc_input_packet_on_conn() is strange
in that it supplies 16-bit old and new fields to a cmpxchg()
of a 32-bit quantity, but it is quite possible that this would
be a 16-bit quantity if permitted in core code. This is in an
rxrpc_connection structure.

There are no doubt some false negatives omitted from this list, but
there are a few places that use or would like to use 8-bit and 16-bit
cmpxchg*().

Thanx, Paul