2006-11-04 16:49:28

by Falk Hueffner

[permalink] [raw]
Subject: ipc/msg.c "cleanup" breaks fakeroot on Alpha

Hi,

this hunk

commit 5a06a363ef48444186f18095ae1b932dddbbfa89
Author: Ingo Molnar <[email protected]>
Date: Sun Jul 30 03:04:11 2006 -0700

[PATCH] ipc/msg.c: clean up coding style

Clean up ipc/msg.c to conform to Documentation/CodingStyle. (before it was
an inconsistent hodepodge of various coding styles)

Verified that the before/after .o's are identical.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

diff --git a/ipc/msg.c b/ipc/msg.c
index cd92d34..2b4fccf 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -47,12 +47,12 @@
struct msg_receiver {
struct list_head r_list;
struct task_struct *r_tsk;

int r_mode;
long r_msgtype;
long r_maxsize;

- struct msg_msg* volatile r_msg;
+ volatile struct msg_msg *r_msg;
};

breaks fakeroot on Alpha (variously hangs or oopses). Backing it out
of 2.6.19-rc4 fixes the issue. Is it possible that this change (which
clearly does change semantics) was made in error?

--
Falk


2006-11-04 17:30:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: ipc/msg.c "cleanup" breaks fakeroot on Alpha


* Falk Hueffner <[email protected]> wrote:

> - struct msg_msg* volatile r_msg;
> + volatile struct msg_msg *r_msg;
> };
>
> breaks fakeroot on Alpha (variously hangs or oopses). Backing it out
> of 2.6.19-rc4 fixes the issue. Is it possible that this change (which
> clearly does change semantics) was made in error?

correct, it was an error, lets back it out.

Another interesting question is: what in the IPC code relies on the
volatility of this field, and shouldnt it be converted to explicit
barriers instead?

Ingo

2006-11-04 17:41:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: ipc/msg.c "cleanup" breaks fakeroot on Alpha



On Sat, 4 Nov 2006, Ingo Molnar wrote:
>
> * Falk Hueffner <[email protected]> wrote:
>
> > - struct msg_msg* volatile r_msg;
> > + volatile struct msg_msg *r_msg;
> > };
> >
> > breaks fakeroot on Alpha (variously hangs or oopses). Backing it out
> > of 2.6.19-rc4 fixes the issue. Is it possible that this change (which
> > clearly does change semantics) was made in error?
>
> correct, it was an error, lets back it out.
>
> Another interesting question is: what in the IPC code relies on the
> volatility of this field, and shouldnt it be converted to explicit
> barriers instead?

Absolutely. Anything that depends on "volatile" is broken by definition,
since volatile does _not_ imply memory barriers, and while it may
constrict the compiler in certain ways (and thus effectively make it work
on things like x86 where the memory ordering is fairly easy to get to
work), it is not going to do _anything_ on some other architectures,
except just perhaps make a bug harder to trigger.

Falk - do you have a couple of the oopses (the more, the better: race
conditions tend to have subtle oopses, and with more oopses it is easier
to try to figure out the pattern) that you can point people at, so that we
can get an idea of what's going on.

This looks like a classic case of "volatile is a sign of a bug".

Linus

2006-11-04 18:13:04

by Falk Hueffner

[permalink] [raw]
Subject: Re: ipc/msg.c "cleanup" breaks fakeroot on Alpha

Linus Torvalds <[email protected]> writes:

> Falk - do you have a couple of the oopses (the more, the better: race
> conditions tend to have subtle oopses, and with more oopses it is easier
> to try to figure out the pattern) that you can point people at, so that we
> can get an idea of what's going on.

I've got only one, it's perfectly reproducible with e.g. "fakeroot ls":

ksymoops 2.4.11 on alpha 2.6.19-rc4. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.6.19-rc4/ (default)
-m /src/linux-2.6.19-rc4/System.map (specified)

Error (regular_file): read_ksyms stat /proc/ksyms failed
No modules in ksyms, skipping objects
No ksyms, skipping lsmod
Nov 4 19:00:23 juist kernel: Unable to handle kernel paging request at virtual address 0000025800000683
Nov 4 19:00:23 juist kernel: faked-sysv(2111): Oops 0
Nov 4 19:00:23 juist kernel: pc = [<fffffc0000322810>] ra = [<fffffc0000322810>] ps = 0007 Not tainted
Using defaults from ksymoops -t elf64-alpha -a alpha
Nov 4 19:00:23 juist kernel: v0 = fffffc0000724c00 t0 = 0000000000000000 t1 = 0000000000000001
Nov 4 19:00:23 juist kernel: t2 = 00000000000003e8 t3 = fffffffffffffff5 t4 = fffffc006f36bde8
Nov 4 19:00:23 juist kernel: t5 = fffffc006f36bde8 t6 = 00000200001c41cc t7 = fffffc006f368000
Nov 4 19:00:23 juist kernel: s0 = 000000000000000f s1 = 0000025800000683 s2 = fffffc006fdf3e10
Nov 4 19:00:23 juist kernel: s3 = 0000000000000000 s4 = fffffc00006ca8e0 s5 = 0000000000000100
Nov 4 19:00:23 juist kernel: s6 = fffffc006f36bd78
Nov 4 19:00:23 juist kernel: a0 = 0000000000000007 a1 = fffffc006f36bda8 a2 = 0000000000000000
Nov 4 19:00:23 juist kernel: a3 = 0000000000000001 a4 = 0000000000000000 a5 = 000002000000cf80
Nov 4 19:00:23 juist kernel: t8 = 00000200001c41bc t9 = 00000200001c41bc t10= 000000007fffffff
Nov 4 19:00:23 juist kernel: t11= 00000200001c54d2 pv = fffffc00003229b0 at = 000000000749d915
Nov 4 19:00:23 juist kernel: gp = fffffc0000720200 sp = fffffc006f36bd78
Nov 4 19:00:23 juist kernel: Trace:
Nov 4 19:00:23 juist kernel: [<fffffc00003f8228>] expunge_all+0x58/0x90
Nov 4 19:00:23 juist kernel: [<fffffc00003f884c>] sys_msgctl+0x1cc/0x730
Nov 4 19:00:23 juist kernel: [<fffffc00003112a4>] entSys+0xa4/0xc0
Nov 4 19:00:23 juist kernel: Code: 47f0040a b59e0020 b75e0000 47f2040c 4921f629 d35fff0a <a44a0000> 47e0040b


>>RA; fffffc0000322810 <try_to_wake_up+40/180>

>>PC; fffffc0000322810 <try_to_wake_up+40/180> <=====

Trace; fffffc00003f8228 <expunge_all+58/90>
Trace; fffffc00003f884c <sys_msgctl+1cc/730>
Trace; fffffc00003112a4 <entSys+a4/c0>

Code; fffffc00003227f8 <try_to_wake_up+28/180>
0000000000000000 <_PC>:
Code; fffffc00003227f8 <try_to_wake_up+28/180>
0: 0a 04 f0 47 mov a0,s1
Code; fffffc00003227fc <try_to_wake_up+2c/180>
4: 20 00 9e b5 stq s3,32(sp)
Code; fffffc0000322800 <try_to_wake_up+30/180>
8: 00 00 5e b7 stq ra,0(sp)
Code; fffffc0000322804 <try_to_wake_up+34/180>
c: 0c 04 f2 47 mov a2,s3
Code; fffffc0000322808 <try_to_wake_up+38/180>
10: 29 f6 21 49 zapnot s0,0xf,s0
Code; fffffc000032280c <try_to_wake_up+3c/180>
14: 0a ff 5f d3 bsr ra,fffffffffffffc40 <_PC+0xfffffffffffffc40>
Code; fffffc0000322810 <try_to_wake_up+40/180> <=====
18: 00 00 4a a4 ldq t1,0(s1) <=====
Code; fffffc0000322814 <try_to_wake_up+44/180>
1c: 0b 04 e0 47 mov v0,s2


1 error issued. Results may not be reliable.


--
Falk

2006-11-05 16:03:08

by Manfred Spraul

[permalink] [raw]
Subject: Re: ipc/msg.c "cleanup" breaks fakeroot on Alpha

Linus Torvalds wrote:

>[ Removed the kernel mailing list ]
>
>
[kernel mailing list added back]

>As far as I can tell, people hold one or the other, but not both, and
>happily do strange things to "r_msg". The code seems to _know_ that it is
>racy, since in addition to the volatile, it does things like:
>
> ...
> msr->r_msg = NULL;
> wake_up_process(msr->r_tsk);
> smp_mb();
> msr->r_msg = ERR_PTR(res);
> ...
>
>and that memory barrier again doesn't really seem to make a whole lot of
>sense.
>
>
>
msr is a msg_receiver structure. The structure is stored on the stack of
msr->r_tsk.
The smp_mb() guarantees that the wake_up_process is complete before
ERR_PTR(res) is stored into msr->r_msg.

>But I don't know. It clearly _tries_ to do some smart locking, I just
>don't see what the rules are.
>
>
>
The codes tries to to a lockless receive:
- the mutex is only required to create/destroy queues.
- normal queue operations are protected by msg_lock(msqid), which is a
spinlock. One spinlock for each queue.
- if a receiving thread doesn't find a message, then it adds a
msg_receiver structure into msq->q_receivers. This linked list is stored
in the message queue structure and protected by msg_lock(msqid).
- if a sending thread finds a msg_receiver structure, then it removes
the structure from the msq->q_receivers linked list, places the message
into msr->r_msg and wakes up the receiver. All operations happen under
msg_lock(msqid)
- the receiver notices that there is a message in it's msr->r_msg
structure and copies it to user space, without acquiring msg_lock(msqid).

ipc/sem.c uses the same idea, I added a longer block with documentation
(around line 150 in ipc/sem.c)

I'm only aware of one tricky race:
- the sender calls wake_up_process().
- as soon as the receiver finds something in r_msg, it can return to
user space. user space can call exit(3). do_exit destroys the task
structure.
- wake_up_process will cause an oops if it's called after do_exit().
This race happened on s390. The solution is this block:

msr->r_msg = NULL;
wake_up_process(msr->r_tsk);
smp_mb();
msr->r_msg = ERR_PTR(res);

Initially, r_msg is ERR_PTR(-EAGAIN). The sender first sets it to NULL
("message pending"), then calls wake_up_process(), then a memory
barrier, then the final value.

Back to the bug report:
"volatile" shouln't be necessary. The critical part is this loop:

msg = (struct msg_msg*)msr_d.r_msg;
while (msg == NULL) {
cpu_relax();
msg = (struct msg_msg *)msr_d.r_msg;
}
And cpu_relax is a barrier().
On i386, removing the "volatile" has no effect, the .o remains identical.
Falk, could you compare the .o files with/without volatile? Are there
any differences?

The oops was caused by try_to_wake_up, called by expunge_all.
I.e.:
- either the msq->q_receivers linked list got corrupted
- or the target thread was destroyed before wake_up_process completed.
Theoretically, both things are impossible:
- msq->q_receivers is protected by msq_lock()
- the target thread task_struct is guaranteed to remain in scope due to
the "msg == NULL" loop.

I'll try to reproduce the oops on i386 - but I don't see a bug right now.

--
Manfred

2006-11-06 05:59:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: ipc/msg.c "cleanup" breaks fakeroot on Alpha

On Sun, Nov 05, 2006 at 05:02:39PM +0100, Manfred Spraul wrote:
> Linus Torvalds wrote:
>
> >[ Removed the kernel mailing list ]
> >
> >
> [kernel mailing list added back]
>
> >As far as I can tell, people hold one or the other, but not both, and
> >happily do strange things to "r_msg". The code seems to _know_ that it is
> >racy, since in addition to the volatile, it does things like:
> >
> > ...
> > msr->r_msg = NULL;
> > wake_up_process(msr->r_tsk);
> > smp_mb();
> > msr->r_msg = ERR_PTR(res);
> > ...
> >
> >and that memory barrier again doesn't really seem to make a whole lot of
> >sense.
> >
> msr is a msg_receiver structure. The structure is stored on the stack of
> msr->r_tsk.
> The smp_mb() guarantees that the wake_up_process is complete before
> ERR_PTR(res) is stored into msr->r_msg.

OK, so matching code in sys_msgrcv() spins on the r_tsk pointer
until it becomes non-NULL. Can't say that I understand why the
first assignment to msr->r_msg cannot simply be ERR_PTR(res), but
I am probably missing something basic. But your explanation below
clears that up, thank you!

I also don't understand why the code in sys_msgrcv() doesn't have
to remap the msqid, similar to the way it is done in sys_semtimedop().

Is there some reference counter that I am failing to see? There is
the ipc_rcu_getref() on the send side, and something similar seems
like it would work on the read side -- though with additional cache
thrashing, so probably cheaper to remap the id.

So, what am I missing here? How does a msgrcv() racing with an rmid()
avoid taking a lock on a message queue that just got freed? (The
ipc_lock_by_ptr() in "Lockless receive, part 3".) My concern is the
following sequence of steps:

o expunge_all() invokes wake_up_process() and sets r_msg.

o sys_msgrcv() is awakened, but for whatever reason does
not actually start executing (e.g., lots of other busy
processes at higher priority).

o expunge_all() returns to freeque(), which runs through the
rest of its processing, finally calling ipc_rcu_putref().

o ipc_rcu_putref() invokes call_rcu() to free the message
queue after a grace period.

o ipc_immediate_free() is invoked at the end of a grace
period, freeing the message queue.

o sys_msgrcv() finally gets a chance to run, and does an
rcu_read_lock() -- but too late!!!

o sys_msgrcv() does ipc_lock_by_ptr() on a message-queue
data structure that has already been freed. Things
degrade rapidly from here...

Or is there some subtlety that prevents this?

If this problem is real, either an ipc_rcu_getref() before the
msg_unlock() before the schedule() and an ipc_rcu_putref() after
the rcu_read_lock() following the schedule(), both in sys_msgrcv()
on the one hand... Or remap the msqid after the rcu_read_lock()
following the schedule on the other.

Thanx, Paul

> >But I don't know. It clearly _tries_ to do some smart locking, I just
> >don't see what the rules are.
> >
> >
> >
> The codes tries to to a lockless receive:
> - the mutex is only required to create/destroy queues.
> - normal queue operations are protected by msg_lock(msqid), which is a
> spinlock. One spinlock for each queue.
> - if a receiving thread doesn't find a message, then it adds a
> msg_receiver structure into msq->q_receivers. This linked list is stored
> in the message queue structure and protected by msg_lock(msqid).
> - if a sending thread finds a msg_receiver structure, then it removes
> the structure from the msq->q_receivers linked list, places the message
> into msr->r_msg and wakes up the receiver. All operations happen under
> msg_lock(msqid)
> - the receiver notices that there is a message in it's msr->r_msg
> structure and copies it to user space, without acquiring msg_lock(msqid).
>
> ipc/sem.c uses the same idea, I added a longer block with documentation
> (around line 150 in ipc/sem.c)
>
> I'm only aware of one tricky race:
> - the sender calls wake_up_process().
> - as soon as the receiver finds something in r_msg, it can return to
> user space. user space can call exit(3). do_exit destroys the task
> structure.
> - wake_up_process will cause an oops if it's called after do_exit().
> This race happened on s390. The solution is this block:
>
> msr->r_msg = NULL;
> wake_up_process(msr->r_tsk);
> smp_mb();
> msr->r_msg = ERR_PTR(res);
>
> Initially, r_msg is ERR_PTR(-EAGAIN). The sender first sets it to NULL
> ("message pending"), then calls wake_up_process(), then a memory
> barrier, then the final value.
>
> Back to the bug report:
> "volatile" shouln't be necessary. The critical part is this loop:
>
> msg = (struct msg_msg*)msr_d.r_msg;
> while (msg == NULL) {
> cpu_relax();
> msg = (struct msg_msg *)msr_d.r_msg;
> }
> And cpu_relax is a barrier().
> On i386, removing the "volatile" has no effect, the .o remains identical.
> Falk, could you compare the .o files with/without volatile? Are there
> any differences?
>
> The oops was caused by try_to_wake_up, called by expunge_all.
> I.e.:
> - either the msq->q_receivers linked list got corrupted
> - or the target thread was destroyed before wake_up_process completed.
> Theoretically, both things are impossible:
> - msq->q_receivers is protected by msq_lock()
> - the target thread task_struct is guaranteed to remain in scope due to
> the "msg == NULL" loop.
>
> I'll try to reproduce the oops on i386 - but I don't see a bug right now.
>
> --
> Manfred

2006-11-06 06:24:09

by Manfred Spraul

[permalink] [raw]
Subject: Re: ipc/msg.c "cleanup" breaks fakeroot on Alpha

Paul E. McKenney wrote:

>I also don't understand why the code in sys_msgrcv() doesn't have
>to remap the msqid, similar to the way it is done in sys_semtimedop().
>
>
>
What do you mean with remap?

>So, what am I missing here? How does a msgrcv() racing with an rmid()
>avoid taking a lock on a message queue that just got freed? (The
>ipc_lock_by_ptr() in "Lockless receive, part 3".) My concern is the
>following sequence of steps:
>
>o expunge_all() invokes wake_up_process() and sets r_msg.
>
>o sys_msgrcv() is awakened, but for whatever reason does
> not actually start executing (e.g., lots of other busy
> processes at higher priority).
>
>o expunge_all() returns to freeque(), which runs through the
> rest of its processing, finally calling ipc_rcu_putref().
>
>o ipc_rcu_putref() invokes call_rcu() to free the message
> queue after a grace period.
>
>o ipc_immediate_free() is invoked at the end of a grace
> period, freeing the message queue.
>
>o sys_msgrcv() finally gets a chance to run, and does an
> rcu_read_lock() -- but too late!!!
>
>
>
Not too late:
sys_msgrcv() checks msr_d.r_msr, notices that the value is -EIDRM and
returns to user space with -EIDRM immediately. This codepath
doesn't touch the message queue pointer, thus it doesn't matter that the
message queue is already freed.
The code only touches the message queue pointer if msr_d.r_msr
is -EAGAIN - and the rcu_read_lock() guarantees there is no rcu grace
period between the test for -EAGAIN and the ipc_lock_by_ptr.
Thus this should be safe.

But back to the oops:
The oops happens in expunge_all, called from sys_msgctl.
Thus it must be an msgctl(IPC_SET).
IPC_SET is special: it calls expunge_all(-EAGAIN): that's necessary
because IPC_SET can change the permissions.
Unfortunately, faked doesn't use IPC_SET at all :-(

Falk - could you strace your "fakeroot ls" test? Are there any IPC_SET
calls?
Which gcc version do you use? Is it possible that gcc auto-inlined
something?

--
Manfred

2006-11-06 16:17:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: ipc/msg.c "cleanup" breaks fakeroot on Alpha

On Mon, Nov 06, 2006 at 07:23:38AM +0100, Manfred Spraul wrote:
> Paul E. McKenney wrote:
>
> >I also don't understand why the code in sys_msgrcv() doesn't have
> >to remap the msqid, similar to the way it is done in sys_semtimedop().
> >
> What do you mean with remap?

What ipc_lock() does, more or less.

> >So, what am I missing here? How does a msgrcv() racing with an rmid()
> >avoid taking a lock on a message queue that just got freed? (The
> >ipc_lock_by_ptr() in "Lockless receive, part 3".) My concern is the
> >following sequence of steps:
> >
> >o expunge_all() invokes wake_up_process() and sets r_msg.
> >
> >o sys_msgrcv() is awakened, but for whatever reason does
> > not actually start executing (e.g., lots of other busy
> > processes at higher priority).
> >
> >o expunge_all() returns to freeque(), which runs through the
> > rest of its processing, finally calling ipc_rcu_putref().
> >
> >o ipc_rcu_putref() invokes call_rcu() to free the message
> > queue after a grace period.
> >
> >o ipc_immediate_free() is invoked at the end of a grace
> > period, freeing the message queue.
> >
> >o sys_msgrcv() finally gets a chance to run, and does an
> > rcu_read_lock() -- but too late!!!
> >
> Not too late:
> sys_msgrcv() checks msr_d.r_msr, notices that the value is -EIDRM and
> returns to user space with -EIDRM immediately. This codepath
> doesn't touch the message queue pointer, thus it doesn't matter that the
> message queue is already freed.
> The code only touches the message queue pointer if msr_d.r_msr
> is -EAGAIN - and the rcu_read_lock() guarantees there is no rcu grace
> period between the test for -EAGAIN and the ipc_lock_by_ptr.
> Thus this should be safe.

OK, seems like it does handle that scenario. Sorry for the noise!

And the other possible scenario, where the wakeup happens before the
assignment of NULL to msr_d.r_msr, be prevented by the two rounds of
rq lock (one in try_to_wake_up(), the other when the task actually
starts running).

Thanx, Paul

> But back to the oops:
> The oops happens in expunge_all, called from sys_msgctl.
> Thus it must be an msgctl(IPC_SET).
> IPC_SET is special: it calls expunge_all(-EAGAIN): that's necessary
> because IPC_SET can change the permissions.
> Unfortunately, faked doesn't use IPC_SET at all :-(
>
> Falk - could you strace your "fakeroot ls" test? Are there any IPC_SET
> calls?
> Which gcc version do you use? Is it possible that gcc auto-inlined
> something?
>
> --
> Manfred