LinuxLists.cc - PATCH? rcu_do_batch: fix a pure theoretical memory ordering race

2006-12-02 21:25:46

Subject: PATCH? rcu_do_batch: fix a pure theoretical memory ordering race

On top of rcu-add-a-prefetch-in-rcu_do_batch.patch

rcu_do_batch:

struct rcu_head *next, *list;

while (list) {
next = list->next; <------ [1]
list->func(list);
list = next;
}

We can't trust *list after list->func() call, that is why we load list->next
beforehand. However I suspect in theory this is not enough, suppose that

- [1] is stalled

- list->func() marks *list as unused in some way

- another CPU re-uses this rcu_head and dirties it

- [1] completes and gets a wrong result

This means we need a barrier in between. mb() looks more suitable, but I think
rmb() should suffice.

Signed-off-by: Oleg Nesterov <[email protected]>

--- 19-rc6/kernel/rcupdate.c~rdp 2006-12-02 20:46:03.000000000 +0300
+++ 19-rc6/kernel/rcupdate.c 2006-12-02 21:04:12.000000000 +0300
@@ -236,6 +236,8 @@ static void rcu_do_batch(struct rcu_data
list = rdp->donelist;
while (list) {
next = list->next;
+ /* complete the load above before we call ->func() */
+ smp_rmb();
prefetch(next);
list->func(list);
list = next;

2006-12-03 17:39:32

by Eric Dumazet

[permalink] [raw]

Subject: Re: PATCH? rcu_do_batch: fix a pure theoretical memory ordering race

Oleg Nesterov a ?crit :
> On top of rcu-add-a-prefetch-in-rcu_do_batch.patch
>
> rcu_do_batch:
>
> struct rcu_head *next, *list;
>
> while (list) {
> next = list->next; <------ [1]
> list->func(list);
> list = next;
> }
>
> We can't trust *list after list->func() call, that is why we load list->next
> beforehand. However I suspect in theory this is not enough, suppose that
>
> - [1] is stalled
>
> - list->func() marks *list as unused in some way
>
> - another CPU re-uses this rcu_head and dirties it
>
> - [1] completes and gets a wrong result
>
> This means we need a barrier in between. mb() looks more suitable, but I think
> rmb() should suffice.
>

Well, hopefully the "list->func()" MUST do the right thing [*], so your patch
is not necessary.

For example, most structures are freed with kfree()/kmem_cache_free() and
these functions MUST imply an smp_mb() [if/when exchanging data with other
cpus], or else many uses in the kernel should be corrected as well.

[*] : In particular, slab code managment does something special when
transfering local objects from local cpu A store to 'other cpus B'.
Other mechanisms should also use some kind of memory barrier in order to
transfer an object to another cpu too, or you could imagine in flight stores
from CPU A overwriting an object that was 'given' to CPU B.

2006-12-03 20:02:44

by Oleg Nesterov

[permalink] [raw]

Subject: Re: PATCH? rcu_do_batch: fix a pure theoretical memory ordering race

On 12/03, Eric Dumazet wrote:
>
> Oleg Nesterov a ?crit :
> >On top of rcu-add-a-prefetch-in-rcu_do_batch.patch
> >
> >rcu_do_batch:
> >
> > struct rcu_head *next, *list;
> >
> > while (list) {
> > next = list->next; <------ [1]
> > list->func(list);
> > list = next;
> > }
> >
> >We can't trust *list after list->func() call, that is why we load
> >list->next
> >beforehand. However I suspect in theory this is not enough, suppose that
> >
> > - [1] is stalled
> >
> > - list->func() marks *list as unused in some way
> >
> > - another CPU re-uses this rcu_head and dirties it
> >
> > - [1] completes and gets a wrong result
> >
> >This means we need a barrier in between. mb() looks more suitable, but I
> >think
> >rmb() should suffice.
> >
>
> Well, hopefully the "list->func()" MUST do the right thing [*], so your
> patch is not necessary.

Yes, I don't claim it is necessary, note the "pure theoretical".

> For example, most structures are freed with kfree()/kmem_cache_free() and
> these functions MUST imply an smp_mb() [if/when exchanging data with other
> cpus], or else many uses in the kernel should be corrected as well.

Yes, mb() is enough (wmb() isn't) and kfree()/kmem_cache_free() are ok.
And I don't know any example of "unsafe" code in that sense.

However I believe it is easy to make the code which is correct from the
RCU's API pov, but unsafe.

Oleg.

2006-12-03 20:44:50

by Eric Dumazet

[permalink] [raw]

Subject: Re: PATCH? rcu_do_batch: fix a pure theoretical memory ordering race

Oleg Nesterov a ?crit :
> On 12/03, Eric Dumazet wrote:
>> Oleg Nesterov a ?crit :
>>> On top of rcu-add-a-prefetch-in-rcu_do_batch.patch
>>>
>>> rcu_do_batch:
>>>
>>> struct rcu_head *next, *list;
>>>
>>> while (list) {
>>> next = list->next; <------ [1]
>>> list->func(list);
>>> list = next;
>>> }
>>>
>>> We can't trust *list after list->func() call, that is why we load
>>> list->next
>>> beforehand. However I suspect in theory this is not enough, suppose that
>>>
>>> - [1] is stalled
>>>
>>> - list->func() marks *list as unused in some way
>>>
>>> - another CPU re-uses this rcu_head and dirties it
>>>
>>> - [1] completes and gets a wrong result
>>>
>>> This means we need a barrier in between. mb() looks more suitable, but I
>>> think
>>> rmb() should suffice.
>>>
>> Well, hopefully the "list->func()" MUST do the right thing [*], so your
>> patch is not necessary.
>
> Yes, I don't claim it is necessary, note the "pure theoretical".
>
>> For example, most structures are freed with kfree()/kmem_cache_free() and
>> these functions MUST imply an smp_mb() [if/when exchanging data with other
>> cpus], or else many uses in the kernel should be corrected as well.
>
> Yes, mb() is enough (wmb() isn't) and kfree()/kmem_cache_free() are ok.
> And I don't know any example of "unsafe" code in that sense.
>
> However I believe it is easy to make the code which is correct from the
> RCU's API pov, but unsafe.

Yes, but how is it related to RCU ?
I mean, rcu_do_batch() is just a loop like others in kernel.
The loop itself is not buggy, but can call a buggy function, you are right.
A smp_rmb() wont avoid all possible bugs...

2006-12-03 22:12:43

by Oleg Nesterov

[permalink] [raw]

Subject: Re: PATCH? rcu_do_batch: fix a pure theoretical memory ordering race

On 12/03, Eric Dumazet wrote:
>
> Oleg Nesterov a ?crit :
>
> Yes, but how is it related to RCU ?
> I mean, rcu_do_batch() is just a loop like others in kernel.
> The loop itself is not buggy, but can call a buggy function, you are right.

int start_me_again;

struct rcu_head rcu_head;

void rcu_func(struct rcu_head *rcu)
{
start_me_again = 1;
}

// could be called on arbitrary CPU
void check_start_me_again(void)
{
static spinlock_t lock;

spin_lock(lock);
if (start_me_again) {
start_me_again = 0;
call_rcu(&rcu_head, rcu_func);
}
spin_unlock(lock);
}

I'd say this code is not buggy.

In case it was not clear. I do not claim we need this patch (I don't know).
And yes, I very much doubt we can hit this problem in practice (even if I am
right).

What I don't agree with is that it is callback which should take care of this
problem.

> A smp_rmb() wont avoid all possible bugs...

For example?

Oleg.

2006-12-03 23:08:04

by Eric Dumazet

[permalink] [raw]

Subject: Re: PATCH? rcu_do_batch: fix a pure theoretical memory ordering race

Oleg Nesterov a ?crit :
> On 12/03, Eric Dumazet wrote:
>> Oleg Nesterov a ?crit :
>>
>> Yes, but how is it related to RCU ?
>> I mean, rcu_do_batch() is just a loop like others in kernel.
>> The loop itself is not buggy, but can call a buggy function, you are right.
>
> int start_me_again;
>
> struct rcu_head rcu_head;
>
> void rcu_func(struct rcu_head *rcu)
> {
> start_me_again = 1;
> }
>
> // could be called on arbitrary CPU
> void check_start_me_again(void)
> {
> static spinlock_t lock;
>
> spin_lock(lock);
> if (start_me_again) {
> start_me_again = 0;
> call_rcu(&rcu_head, rcu_func);
> }
> spin_unlock(lock);
> }
>
> I'd say this code is not buggy.

Are you sure ? Can you prove it ? :)

I do think your rcu_func() misses some sync primitive, *after* start_me_again=1;
You seem to rely on some undocumented side effect.
Adding smp_rmb() before calling rcu_func() wont help.

>
> In case it was not clear. I do not claim we need this patch (I don't know).
> And yes, I very much doubt we can hit this problem in practice (even if I am
> right).
>
> What I don't agree with is that it is callback which should take care of this
> problem.
>
>> A smp_rmb() wont avoid all possible bugs...
>
> For example?

A smp_rmb() wont avoid stores pending on this CPU to be committed to memory
after another cpu takes the object for itself. Those stores could overwrite
stores done by the other cpu as well.

So in theory you could design a buggy callback function even after your patch
applied.

Any function that can transfer an object from CPU A scope to CPU B scope must
take care of memory barrier by itself. The caller *cannot* possibly do the
job, especially if it used an indirect call. However, in some cases it is
possible some clever algos are doing the reverse, ie doing the memory barrier
in the callers.

Kernel is full of such constructs :

for (ptr = head; ptr != NULL ; ptr = next) {
next = ptr->next;
some_subsys_delete(ptr);
}

And we dont need to add smp_rmb() before the call to some_subsys_delete(), it
would be a nightmare, and would slow down modern cpus.

When the callback function dont need a memory barrier, why should we force a
generic one in the caller ?

AFAIK most kfree() calls dont need a barrier, since slab just queue the
pointer into the cpu's array_cache if there is one available slot.

2006-12-03 23:47:59

by Oleg Nesterov

[permalink] [raw]

Subject: Re: PATCH? rcu_do_batch: fix a pure theoretical memory ordering race

On 12/04, Eric Dumazet wrote:
>
> Oleg Nesterov a ?crit :
> >
> > int start_me_again;
> >
> > struct rcu_head rcu_head;
> >
> > void rcu_func(struct rcu_head *rcu)
> > {
> > start_me_again = 1;
> > }
> >
> > // could be called on arbitrary CPU
> > void check_start_me_again(void)
> > {
> > static spinlock_t lock;
> >
> > spin_lock(lock);
> > if (start_me_again) {
> > start_me_again = 0;
> > call_rcu(&rcu_head, rcu_func);
> > }
> > spin_unlock(lock);
> > }
> >
> >I'd say this code is not buggy.
>
> Are you sure ? Can you prove it ? :)

Looks like you think differently :)

> I do think your rcu_func() misses some sync primitive, *after*
> start_me_again=1;
> You seem to rely on some undocumented side effect.
> Adding smp_rmb() before calling rcu_func() wont help.

I guess you mean that check_start_me_again() can miss start_me_again != 0 ?
Yes, of course, it should check the condition from time to time. We can even
do
start_me_again = 1;
wake_up(&start_me_again_wq);

, this is still unsafe.

> >>A smp_rmb() wont avoid all possible bugs...
> >
> >For example?
>
> A smp_rmb() wont avoid stores pending on this CPU to be committed to memory
> after another cpu takes the object for itself. Those stores could overwrite
> stores done by the other cpu as well.

Yes. But RCU core doesn't write to rcu_head (except call_rcu). Callback _owns_
rcu_head, it should be ok to use it in any way without fear to break RCU.
Of course, callback should take care of its own locking/ordering.

> So in theory you could design a buggy callback function even after your
> patch applied.

So. Do you claim that rcu_func() above is buggy?

> Any function that can transfer an object from CPU A scope to CPU B scope
> must take care of memory barrier by itself. The caller *cannot* possibly do
> the job, especially if it used an indirect call. However, in some cases it
> is possible some clever algos are doing the reverse, ie doing the memory
> barrier in the callers.
>
> Kernel is full of such constructs :
>
> for (ptr = head; ptr != NULL ; ptr = next) {
> next = ptr->next;
> some_subsys_delete(ptr);
> }
>
> And we dont need to add smp_rmb() before the call to some_subsys_delete(),
> it would be a nightmare, and would slow down modern cpus.

Agreed.

Oleg.

2006-12-04 16:42:35

by Paul E. McKenney

[permalink] [raw]

Subject: Re: PATCH? rcu_do_batch: fix a pure theoretical memory ordering race

On Sun, Dec 03, 2006 at 12:25:17AM +0300, Oleg Nesterov wrote:
> On top of rcu-add-a-prefetch-in-rcu_do_batch.patch
>
> rcu_do_batch:
>
> struct rcu_head *next, *list;
>
> while (list) {
> next = list->next; <------ [1]
> list->func(list);
> list = next;
> }
>
> We can't trust *list after list->func() call, that is why we load list->next
> beforehand. However I suspect in theory this is not enough, suppose that
>
> - [1] is stalled
>
> - list->func() marks *list as unused in some way
>
> - another CPU re-uses this rcu_head and dirties it

The memory allocators are required to do whatever is required to
ensure that the first CPU is no longer accessing the memory before
allocating it to the second CPU.

So we should not need to do this -- and if we did, we would have
serious problems throughout the kernel.

Thanx, Paul

> - [1] completes and gets a wrong result
>
> This means we need a barrier in between. mb() looks more suitable, but I think
> rmb() should suffice.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
>
> --- 19-rc6/kernel/rcupdate.c~rdp 2006-12-02 20:46:03.000000000 +0300
> +++ 19-rc6/kernel/rcupdate.c 2006-12-02 21:04:12.000000000 +0300
> @@ -236,6 +236,8 @@ static void rcu_do_batch(struct rcu_data
> list = rdp->donelist;
> while (list) {
> next = list->next;
> + /* complete the load above before we call ->func() */
> + smp_rmb();
> prefetch(next);
> list->func(list);
> list = next;
>