Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
From:   Joel Fernandes <joel@joelfernandes.org>
Mime-Version: 1.0 (1.0)
Subject: Re: [RFC 0/2] srcu: Remove pre-flip memory barrier
Date:   Wed, 21 Dec 2022 00:02:48 -0500
Message-Id: <969CAAB7-5CBE-45F4-AE12-93E51D13F146@joelfernandes.org>
References: <d010a8ca-79a4-bd25-dff1-cb7dee627365@efficios.com>
Cc:     Neeraj Upadhyay <neeraj.iitr10@gmail.com>,
        linux-kernel@vger.kernel.org,
        Josh Triplett <josh@joshtriplett.org>,
        Lai Jiangshan <jiangshanlai@gmail.com>,
        "Paul E. McKenney" <paulmck@kernel.org>, rcu@vger.kernel.org,
        Steven Rostedt <rostedt@goodmis.org>
In-Reply-To: <d010a8ca-79a4-bd25-dff1-cb7dee627365@efficios.com>
To:     Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Precedence: bulk


> On Dec 20, 2022, at 10:51 PM, Mathieu Desnoyers <mathieu.desnoyers@efficio=
s.com> wrote:
>=20
> =EF=BB=BFOn 2022-12-20 15:55, Joel Fernandes wrote:
>>>> On Dec 20, 2022, at 1:29 PM, Joel Fernandes <joel@joelfernandes.org> wr=
ote:
>>>=20
>>> =EF=BB=BF
>>>=20
>>>>> On Dec 20, 2022, at 1:13 PM, Mathieu Desnoyers <mathieu.desnoyers@effi=
cios.com> wrote:
>>>>>=20
>>>>> =EF=BB=BFOn 2022-12-20 13:05, Joel Fernandes wrote:
>>>>> Hi Mathieu,
>>>>>> On Tue, Dec 20, 2022 at 5:00 PM Mathieu Desnoyers
>>>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>>>=20
>>>>>> On 2022-12-19 20:04, Joel Fernandes wrote:
>>>>>>>> On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@joelfernandes.=
org> wrote:
>>>>> [...]
>>>>>>>>> On a 64-bit system, where 64-bit counters are used, AFAIU this nee=
d to
>>>>>>>>> be exactly 2^64 read-side critical sections.
>>>>>>>>=20
>>>>>>>> Yes, but what about 32-bit systems?
>>>>>>=20
>>>>>> The overflow indeed happens after 2^32 increments, just like seqlock.=

>>>>>> The question we need to ask is therefore: if 2^32 is good enough for
>>>>>> seqlock, why isn't it good enough for SRCU ?
>>>>> I think Paul said wrap around does happen with SRCU on 32-bit but I'll=

>>>>> let him talk more about it. If 32-bit is good enough, let us also drop=

>>>>> the size of the counters for 64-bit then?
>>>>>>>>> There are other synchronization algorithms such as seqlocks which a=
re
>>>>>>>>> quite happy with much less protection against overflow (using a 32=
-bit
>>>>>>>>> counter even on 64-bit architectures).
>>>>>>>>=20
>>>>>>>> The seqlock is an interesting point.
>>>>>>>>=20
>>>>>>>>> For practical purposes, I suspect this issue is really just theore=
tical.
>>>>>>>>=20
>>>>>>>> I have to ask, what is the benefit of avoiding a flip and scanning
>>>>>>>> active readers? Is the issue about grace period delay or performanc=
e?
>>>>>>>> If so, it might be worth prototyping that approach and measuring us=
ing
>>>>>>>> rcutorture/rcuscale. If there is significant benefit to current
>>>>>>>> approach, then IMO it is worth exploring.
>>>>>>=20
>>>>>> The main benefit I expect is improved performance of the grace period=

>>>>>> implementation in common cases where there are few or no readers
>>>>>> present, especially on machines with many cpus.
>>>>>>=20
>>>>>> It allows scanning both periods (0/1) for each cpu within the same pa=
ss,
>>>>>> therefore loading both period's unlock counters sitting in the same
>>>>>> cache line at once (improved locality), and then loading both period'=
s
>>>>>> lock counters, also sitting in the same cache line.
>>>>>>=20
>>>>>> It also allows skipping the period flip entirely if there are no read=
ers
>>>>>> present, which is an -arguably- tiny performance improvement as well.=

>>>>> The issue of counter wrap aside, what if a new reader always shows up
>>>>> in the active index being scanned, then can you not delay the GP
>>>>> indefinitely? It seems like writer-starvation is possible then (sure
>>>>> it is possible also with preemption after reader-index-sampling, but
>>>>> scanning active index deliberately will make that worse). Seqlock does=

>>>>> not have such writer starvation just because the writer does not care
>>>>> about what the readers are doing.
>>>>=20
>>>> No, it's not possible for "current index" readers to starve the g.p. wi=
th the side-rcu scheme, because the initial pass (sampling both periods) onl=
y opportunistically skips flipping the period if there happens to be no read=
ers in both periods.
>>>>=20
>>>> If there are readers in the "non-current" period, the grace period wait=
s for them.
>>>>=20
>>>> If there are readers in the "current" period, it flips the period and t=
hen waits for them.
>>>=20
>>> Ok glad you already do that, this is what I was sort of leaning at in my=
 previous email as well, that is doing a hybrid approach. Sorry I did not kn=
ow the details of your side-RCU to know you were already doing something lik=
e that.
>>>=20
>>>>=20
>>>>> That said, the approach of scanning both counters does seem attractive=

>>>>> for when there are no readers, for the reasons you mentioned. Maybe a
>>>>> heuristic to count the number of readers might help? If we are not
>>>>> reader-heavy, then scan both. Otherwise, just scan the inactive ones,
>>>>> and also couple that heuristic with the number of CPUs. I am
>>>>> interested in working on such a design with you! Let us do it and
>>>>> prototype/measure. ;-)
>>>>=20
>>>> Considering that it would add extra complexity, I'm unsure what that ex=
tra heuristic would improve over just scanning both periods in the first pas=
s.
>>>=20
>>> Makes sense, I think you indirectly implement a form of heuristic alread=
y by flipping in case scanning both was not fruitful.
>>>=20
>>>> I'll be happy to work with you on such a design :) I think we can borro=
w quite a few concepts from side-rcu for this. Please be aware that my time i=
s limited though, as I'm currently supposed to be on vacation. :)
>>>=20
>>> Oh, I was more referring to after the holidays. I am also starting vacat=
ion soon and limited In cycles ;-). It is probably better to enjoy the holid=
ays and come back to this after.
>>>=20
>>> I do want to finish my memory barrier studies of SRCU over the holidays s=
ince I have been deep in the hole with that already. Back to the post flip m=
emory barrier here since I think now even that might not be needed=E2=80=A6
>> In my view,  the mb between the totaling of unlocks and totaling of locks=
 serves as the mb that is required to enforce the GP guarantee, which I thin=
k is what Mathieu is referring to.
>=20
> No, AFAIU you also need barriers at the beginning and end of synchronize_s=
rcu to provide those guarantees:

My bad, I got too hung up on the scan code. Indeed we need additional orderi=
ng on synchronize side.

Anyway, the full memory barriers are already implemented in the synchronize c=
ode AFAICS (beginning and end). At least one of them full memory barriers di=
rectly appears at the end of __synchronize_srcu(). But I dont want to say so=
mething stupid in the middle of the night, so I will take my time to get bac=
k on that.

Thanks,

Joel


>=20
> * There are memory-ordering constraints implied by synchronize_srcu().
>=20
> Need for a barrier at the end of synchronize_srcu():
>=20
> * On systems with more than one CPU, when synchronize_srcu() returns,
> * each CPU is guaranteed to have executed a full memory barrier since
> * the end of its last corresponding SRCU read-side critical section
> * whose beginning preceded the call to synchronize_srcu().
>=20
> Need for a barrier at the beginning of synchronize_srcu():
>=20
> * In addition,
> * each CPU having an SRCU read-side critical section that extends beyond
> * the return from synchronize_srcu() is guaranteed to have executed a
> * full memory barrier after the beginning of synchronize_srcu() and before=

> * the beginning of that SRCU read-side critical section.  Note that these
> * guarantees include CPUs that are offline, idle, or executing in user mod=
e,
> * as well as CPUs that are executing in the kernel.
>=20
> Thanks,
>=20
> Mathieu
>=20
>> Neeraj, do you agree?
>> Thanks.
>>>=20
>>> Cheers,
>>>=20
>>> - Joel
>>>=20
>>>=20
>>>>=20
>>>> Thanks,
>>>>=20
>>>> Mathieu
>>>>=20
>>>> --=20
>>>> Mathieu Desnoyers
>>>> EfficiOS Inc.
>>>> https://www.efficios.com
>>>>=20
>=20
> --=20
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>=20