2003-02-16 01:29:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET


On Sat, 15 Feb 2003, Andrew Morton wrote:
>
> The recent change to fs/proc/array.c:task_sig()?
>
> buffer += sprintf(buffer, "ShdPnd:\t");
> buffer = render_sigset_t(&p->signal->shared_pending.signal, buffer);

Yeah, but I think the bug has existed for much longer.

It looks like "proc_pid_status()" doesn't actually lock the task at all,
nor even bother to test whether the task has signal state. Which has
_always_ been a bug. I don't know why it would start happening now, but it
might just be unlucky timing.

I think the proper fix is to put a

task_lock()
..
task_unlock()

around the whole proc_pid_status() function, _and_ then verify that
"tsk->sighand" is non-NULL.

(Oh, careful, that's already what "get_task_mm()" does internally, so look
out for deadlocks - you'd need to open-code the get_task_mm() in there
too, so the end result is something like

task_lock(task)
if (task->mm) {
.. mm state
}
if (task->sighand) {
.. signal state
}
..
task_unlock(task);

instead).

Linus


2003-02-16 02:00:17

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET

> Yeah, but I think the bug has existed for much longer.
>
> It looks like "proc_pid_status()" doesn't actually lock the task at all,
> nor even bother to test whether the task has signal state. Which has
> _always_ been a bug. I don't know why it would start happening now, but
> it might just be unlucky timing.
>
> I think the proper fix is to put a
>
> task_lock()
> ..
> task_unlock()
>
> around the whole proc_pid_status() function, _and_ then verify that
> "tsk->sighand" is non-NULL.
>
> (Oh, careful, that's already what "get_task_mm()" does internally, so
> look out for deadlocks - you'd need to open-code the get_task_mm() in
> there too, so the end result is something like
>
> task_lock(task)
> if (task->mm) {
> .. mm state
> }
> if (task->sighand) {
> .. signal state
> }
> ..
> task_unlock(task);
>
> instead).

Something like this? Compiles, but not tested yet ...

diff -urpN -X /home/fletch/.diff.exclude virgin/fs/proc/array.c
proc/fs/proc/array.c
--- virgin/fs/proc/array.c Sat Feb 15 16:11:45 2003
+++ proc/fs/proc/array.c Sat Feb 15 18:05:10 2003
@@ -243,8 +243,10 @@ extern char *task_mem(struct mm_struct *
int proc_pid_status(struct task_struct *task, char * buffer)
{
char * orig = buffer;
- struct mm_struct *mm = get_task_mm(task);
+ struct mm_struct *mm;

+ task_lock(task);
+ mm = __get_task_mm(task);
buffer = task_name(task, buffer);
buffer = task_state(task, buffer);

@@ -257,6 +259,7 @@ int proc_pid_status(struct task_struct *
#if defined(CONFIG_ARCH_S390)
buffer = task_show_regs(task, buffer);
#endif
+ task_unlock(task);
return buffer - orig;
}

diff -urpN -X /home/fletch/.diff.exclude virgin/include/linux/sched.h
proc/include/linux/sched.h
--- virgin/include/linux/sched.h Sat Feb 15 16:11:47 2003
+++ proc/include/linux/sched.h Sat Feb 15 18:04:42 2003
@@ -706,6 +706,18 @@ static inline struct mm_struct * get_tas

return mm;
}
+
+/* lockless version of get_task_mm */
+static inline struct mm_struct * __get_task_mm(struct task_struct * task)
+{
+ struct mm_struct * mm;
+
+ mm = task->mm;
+ if (mm)
+ atomic_inc(&mm->mm_users);
+
+ return mm;
+}


/* set thread flags in other task's structures

2003-02-16 02:21:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET


On Sat, 15 Feb 2003, Martin J. Bligh wrote:
>
> Something like this? Compiles, but not tested yet ...

Close, but you also need

- buffer = task_sig(task, buffer);
+ if (task->sighand)
+ buffer = task_sig(task, buffer);

to actually check whether signals exist or not. Otherwise you'll just get
the same oops anyway (well, keeping the task locked may change timings
enough that it doesn't happen, but the bug will continue to be there.

I would also say that since you explicitly take the task lock, there's no
real reason to use "get_task_mm()" at all any more, so instead of doing
that (and then doing the mmput()), just get rid of the mm variable
entirely, and do

if (task->mm)
buffer = task_mem(task->mm, buffer)

too. There's really no downside to just holding on to the task lock over
the whole operation instead of incrementing and decrementing the mm
counts and dropping the lock early.

(There are a few valid reasons for using the "get_task_mm()" function:

- you need to block and thus drop the task lock

- the original code just used "task->mm" directly, and using
"get_task_mm()" made the original conversion to mm safe handling
easier.

Neither reason is really valid any more in this function at least).

Linus

2003-02-16 02:40:00

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET

On Sat, 15 Feb 2003, Martin J. Bligh wrote:

> - struct mm_struct *mm = get_task_mm(task);
> + struct mm_struct *mm;
>
> + task_lock(task);
> + mm = __get_task_mm(task);
> buffer = task_name(task, buffer);
> buffer = task_state(task, buffer);

task_state acquires the task_struct lock.

Zwane
--
function.linuxpower.ca

2003-02-16 03:51:16

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET

> Close, but you also need
>
> - buffer = task_sig(task, buffer);
> + if (task->sighand)
> + buffer = task_sig(task, buffer);
>
> to actually check whether signals exist or not. Otherwise you'll just get
> the same oops anyway (well, keeping the task locked may change timings
> enough that it doesn't happen, but the bug will continue to be there.
>
> I would also say that since you explicitly take the task lock, there's no
> real reason to use "get_task_mm()" at all any more, so instead of doing
> that (and then doing the mmput()), just get rid of the mm variable
> entirely, and do
>
> if (task->mm)
> buffer = task_mem(task->mm, buffer)
>
> too. There's really no downside to just holding on to the task lock over
> the whole operation instead of incrementing and decrementing the mm
> counts and dropping the lock early.
>
> (There are a few valid reasons for using the "get_task_mm()" function:
>
> - you need to block and thus drop the task lock
>
> - the original code just used "task->mm" directly, and using
> "get_task_mm()" made the original conversion to mm safe handling
> easier.
>
> Neither reason is really valid any more in this function at least).

OK, I did the following, which is what I think you wanted, plus Zwane's
observation that task_state acquires the task_struct lock (we're the only
caller, so I just removed it), but I still get the same panic and this time
the box hung. No doubt I've just done something stupid in the patch ...
(task_state takes the tasklist_lock ... is that safe to do inside task_lock?)

diff -urpN -X /home/fletch/.diff.exclude virgin/fs/proc/array.c sdet/fs/proc/array.c
--- virgin/fs/proc/array.c Sat Feb 15 16:11:45 2003
+++ sdet/fs/proc/array.c Sat Feb 15 19:28:34 2003
@@ -166,12 +166,10 @@ static inline char * task_state(struct t
p->uid, p->euid, p->suid, p->fsuid,
p->gid, p->egid, p->sgid, p->fsgid);
read_unlock(&tasklist_lock);
- task_lock(p);
buffer += sprintf(buffer,
"FDSize:\t%d\n"
"Groups:\t",
p->files ? p->files->max_fds : 0);
- task_unlock(p);

for (g = 0; g < p->ngroups; g++)
buffer += sprintf(buffer, "%d ", p->groups[g]);
@@ -243,20 +241,20 @@ extern char *task_mem(struct mm_struct *
int proc_pid_status(struct task_struct *task, char * buffer)
{
char * orig = buffer;
- struct mm_struct *mm = get_task_mm(task);

+ task_lock(task);
buffer = task_name(task, buffer);
buffer = task_state(task, buffer);

- if (mm) {
- buffer = task_mem(mm, buffer);
- mmput(mm);
- }
- buffer = task_sig(task, buffer);
+ if (task->mm)
+ buffer = task_mem(task->mm, buffer);
+ if (task->sighand)
+ buffer = task_sig(task, buffer);
buffer = task_cap(task, buffer);
#if defined(CONFIG_ARCH_S390)
buffer = task_show_regs(task, buffer);
#endif
+ task_unlock(task);
return buffer - orig;
}




Unable to handle kernel NULL pointer dereference at virtual address 00000014
printing eip:
c0118b13
*pde = 2d4bf001
*pte = 00000000
Oops: 0000
CPU: 6
EIP: 0060:[<c0118b13>] Not tainted
EFLAGS: 00010207
EIP is at render_sigset_t+0x17/0x7c
eax: 00000010 ebx: ed3d909f ecx: ed3d9097 edx: eda60100
esi: 0000003c edi: 00000010 ebp: ed49bf04 esp: ed49bef8
ds: 007b es: 007b ss: 0068
Process ps (pid: 3339, threadinfo=ed49a000 task=ee32cd20)
Stack: ed3d907d 00000002 ed3d909f 00000006 c016d85f 00000010 ed3d909f ed3d9097
c0233f38 eda606bc ed3d9086 ed3d907e c0233f2f ee1d7b70 eda60100 eeb4d0c0
ed3d9000 ee1d7b70 00000000 ed3d9000 000000d0 eeb4d0c0 c01300f6 c17119e8
Call Trace:
[<c016d85f>] proc_pid_status+0x26f/0x334
[<c01300f6>] __get_free_pages+0x4e/0x54
[<c016b517>] proc_info_read+0x53/0x130
[<c0145295>] vfs_read+0xa5/0x128
[<c0145522>] sys_read+0x2a/0x3c
[<c0108b3f>] syscall_call+0x7/0xb

Code: 0f a3 37 19 d2 b8 01 00 00 00 31 c9 85 d2 0f 45 c8 8d 56 01
<1>Unable to handle kernel NULL pointer dereference at virtual address 00000014
printing eip:
c0118b13
*pde = 2d512001
*pte = 00000000
Oops: 0000
CPU: 2
EIP: 0060:[<c0118b13>] Not tainted
EFLAGS: 00010207
EIP is at render_sigset_t+0x17/0x7c
eax: 00000010 ebx: ee34b0a0 ecx: ee34b098 edx: ed3ea7a0
esi: 0000003c edi: 00000010 ebp: ed54ff04 esp: ed54fef8
ds: 007b es: 007b ss: 0068
Process ps (pid: 14710, threadinfo=ed54e000 task=ee430d40)
Stack: ee34b07e 00000002 ee34b0a0 00000006 c016d85f 00000010 ee34b0a0 ee34b098
c0233f38 ed3ead5c ee34b087 ee34b07f c0233f2f ee4960d0 ed3ea7a0 eec66ca0
ee34b000 ee4960d0 00000000 ee34b000 000000d0 eec66ca0 c01300f6 c17383b8
Call Trace:
[<c016d85f>] proc_pid_status+0x26f/0x334
[<c01300f6>] __get_free_pages+0x4e/0x54
[<c016b517>] proc_info_read+0x53/0x130
[<c0145295>] vfs_read+0xa5/0x128
[<c0145522>] sys_read+0x2a/0x3c
[<c0108b3f>] syscall_call+0x7/0xb

Code: 0f a3 37 19 d2 b8 01 00 00 00 31 c9 85 d2 0f 45 c8 8d 56 01

2003-02-16 12:59:59

by Anton Blanchard

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET


> OK, I did the following, which is what I think you wanted, plus Zwane's
> observation that task_state acquires the task_struct lock (we're the only
> caller, so I just removed it), but I still get the same panic and this time
> the box hung. No doubt I've just done something stupid in the patch ...
> (task_state takes the tasklist_lock ... is that safe to do inside task_lock?)

It looks like you now try and take the tasklist_lock inside a task_lock()
region, but task_lock says no-can-do:

/* Protects ->fs, ->files, ->mm, and synchronises with wait4(). Nests
* inside tasklist_lock */
static inline void task_lock(struct task_struct *p)
{
...

Anton

2003-02-16 16:30:03

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET

>> OK, I did the following, which is what I think you wanted, plus Zwane's
>> observation that task_state acquires the task_struct lock (we're the only
>> caller, so I just removed it), but I still get the same panic and this time
>> the box hung. No doubt I've just done something stupid in the patch ...
>> (task_state takes the tasklist_lock ... is that safe to do inside task_lock?)
>
> It looks like you now try and take the tasklist_lock inside a task_lock()
> region, but task_lock says no-can-do:
>
> /* Protects ->fs, ->files, ->mm, and synchronises with wait4(). Nests
> * inside tasklist_lock */
> static inline void task_lock(struct task_struct *p)
> {
> ...

Right, that's what I expected, and that may well explain the hang, but I
don't see how that explains the oops still happening. What exactly is
tasklist_lock protecting here anyway?

read_lock(&tasklist_lock);
buffer += sprintf(buffer,
"State:\t%s\n"
"Tgid:\t%d\n"
"Pid:\t%d\n"
"PPid:\t%d\n"
"TracerPid:\t%d\n"
"Uid:\t%d\t%d\t%d\t%d\n"
"Gid:\t%d\t%d\t%d\t%d\n",
get_task_state(p), p->tgid,
p->pid, p->pid ? p->real_parent->pid : 0,
p->pid && p->ptrace ? p->parent->pid : 0,
p->uid, p->euid, p->suid, p->fsuid,
p->gid, p->egid, p->sgid, p->fsgid);
read_unlock(&tasklist_lock);

Is it these two accesses:

p->real_parent->pid ?
p->parent->pid ?

Don't see what I can do for this apart from to invert the ordering and take
tasklist_lock around the whole function, and nest task_lock inside that, or
I suppose I could take the task_lock for each of the parents? I seem to
recall Linus reminding people recently that it was only the lock
acquisition order that was important, not release ... does something like
the following look OK?


diff -urpN -X /home/fletch/.diff.exclude virgin/fs/proc/array.c sdet2/fs/proc/array.c
--- virgin/fs/proc/array.c Sat Feb 15 16:11:45 2003
+++ sdet2/fs/proc/array.c Sun Feb 16 08:37:45 2003
@@ -147,11 +147,11 @@ static inline const char * get_task_stat
return *p;
}

+/* Call me with the tasklist_lock and task_lock for p held already */
static inline char * task_state(struct task_struct *p, char *buffer)
{
int g;

- read_lock(&tasklist_lock);
buffer += sprintf(buffer,
"State:\t%s\n"
"Tgid:\t%d\n"
@@ -165,13 +165,10 @@ static inline char * task_state(struct t
p->pid && p->ptrace ? p->parent->pid : 0,
p->uid, p->euid, p->suid, p->fsuid,
p->gid, p->egid, p->sgid, p->fsgid);
- read_unlock(&tasklist_lock);
- task_lock(p);
buffer += sprintf(buffer,
"FDSize:\t%d\n"
"Groups:\t",
p->files ? p->files->max_fds : 0);
- task_unlock(p);

for (g = 0; g < p->ngroups; g++)
buffer += sprintf(buffer, "%d ", p->groups[g]);
@@ -243,20 +240,22 @@ extern char *task_mem(struct mm_struct *
int proc_pid_status(struct task_struct *task, char * buffer)
{
char * orig = buffer;
- struct mm_struct *mm = get_task_mm(task);

+ read_lock(&tasklist_lock);
+ task_lock(task);
buffer = task_name(task, buffer);
buffer = task_state(task, buffer);
+ read_unlock(&tasklist_lock);

- if (mm) {
- buffer = task_mem(mm, buffer);
- mmput(mm);
- }
- buffer = task_sig(task, buffer);
+ if (task->mm)
+ buffer = task_mem(task->mm, buffer);
+ if (task->sighand)
+ buffer = task_sig(task, buffer);
buffer = task_cap(task, buffer);
#if defined(CONFIG_ARCH_S390)
buffer = task_show_regs(task, buffer);
#endif
+ task_unlock(task);
return buffer - orig;
}





2003-02-16 18:01:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET


On Sat, 15 Feb 2003, Martin J. Bligh wrote:
>
> OK, I did the following, which is what I think you wanted, plus Zwane's
> observation that task_state acquires the task_struct lock (we're the only
> caller, so I just removed it), but I still get the same panic and this time
> the box hung.

Yeah, a closer look shows that the exit path doesn't actually take the
task lock at all around any of the signal stuff, so the lock protects
"task->mm", "task->files" and "task->fs", but it does NOT protect
"task->signal" or "task->sighand" (illogical, but true).

Oh, damn.

The core that checks for "p->sighand" takes the tasklist lock for this
reason (see "collect_sigign_sigcatch()"

So the choice seems to be either:

- make the exit path hold the task lock over the whole exit path, not
just over mm exit.

- take the "tasklist_lock" over more of "task_sig()" (not just the
collect_sigign_sigcatch() thing, but the "&p->signal->shared_pending"
rendering too.

The latter is a two-liner. The former is the "right thing" for multiple
reasons.

The reason I'd _like_ to see the task lock held over _all_ of the fields
in the exit() path is:

- right now we actually take it and drop it multiple times in exit. See
__exit_files(), __exit_fs(), __exit_mm(). Which all take it just to
serialize setting ot "mm/files/fs" to NULL.

- task_lock is a nice local lock, no global scalability impact.

So it really would be much nicer to do something like this in do_exit():

struct mm_struct *mm;
struct files_struct *files;
struct fs_struct *fs;
struct signal_struct *signal;
struct sighand_struct *sighand;

task_lock(task);
mm = task->mm;
files = task->files;
fs = task->fs;
signal = task->signal;
sighand = task->sighand;

task->mm = NULL;
task->files = NULL;
task->fs = NULL;
task->signal = NULL;
task->sighand = NULL;
task_unlock(task);

.. actually do the "__exit_mm(task, mm)" etc here ..

which would make things a lot more readable, and result in us taking the
lock only _once_ instead of three times, and would properly protect
"signal" and "sighand" so that the /proc code wouldn't need to take the
heavily used "tasklist_lock" just to read the signal state for a single
task.

But fixing up exit to do the above would require several (trivial) calling
convention changes, like changing

static inline void __exit_mm(struct task_struct * tsk)
{
struct mm_struct *mm = tsk->mm;
...

into

static inline void __exit_mm(struct task_struct * tsk,
struct mm_struct *mm)
{
...

instead and updatign the callers.

Is anybody willing to do that (hopefully fairly trivial) fixup and test
it, or should we go with the stupid "take the 'tasklist_lock'" approach?

Linus

2003-02-16 18:14:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET


On Sun, 16 Feb 2003, Martin J. Bligh wrote:
>
> Right, that's what I expected, and that may well explain the hang, but I
> don't see how that explains the oops still happening. What exactly is
> tasklist_lock protecting here anyway?
>
> read_lock(&tasklist_lock);
> buffer += sprintf(buffer,
> "State:\t%s\n"
> "Tgid:\t%d\n"
> "Pid:\t%d\n"
> "PPid:\t%d\n"
> "TracerPid:\t%d\n"
> "Uid:\t%d\t%d\t%d\t%d\n"
> "Gid:\t%d\t%d\t%d\t%d\n",
> get_task_state(p), p->tgid,
> p->pid, p->pid ? p->real_parent->pid : 0,
> p->pid && p->ptrace ? p->parent->pid : 0,
> p->uid, p->euid, p->suid, p->fsuid,
> p->gid, p->egid, p->sgid, p->fsgid);
> read_unlock(&tasklist_lock);
>
> Is it these two accesses:
>
> p->real_parent->pid ?
> p->parent->pid ?

Yup.

> Don't see what I can do for this apart from to invert the ordering and take
> tasklist_lock around the whole function, and nest task_lock inside that, or
> I suppose I could take the task_lock for each of the parents? I seem to
> recall Linus reminding people recently that it was only the lock
> acquisition order that was important, not release ... does something like
> the following look OK?

This patch looks like it should certainly fix the problem, but that is
still some god-awful ugly overkill in locking.

I'd rather make the rule be that you have to take the task lock before
modifying things like the parent pointers (and all the other fundamntal
pointers), since that's already the rule for most of it anyway.

And then the tasklist lock would go away _entirely_ from /proc (except for
task lookup in ->readdir/->lookup, of course, where it is fundamentally
needed and proper - and will probably some day be replaced by RCU, I
suspect).

Linus

2003-02-16 18:16:48

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET

> So the choice seems to be either:
>
> - make the exit path hold the task lock over the whole exit path, not
> just over mm exit.
>
> - take the "tasklist_lock" over more of "task_sig()" (not just the
> collect_sigign_sigcatch() thing, but the "&p->signal->shared_pending"
> rendering too.
>
> The latter is a two-liner. The former is the "right thing" for multiple
> reasons.
>
> The reason I'd _like_ to see the task lock held over _all_ of the fields
> in the exit() path is:
>
> - right now we actually take it and drop it multiple times in exit. See
> __exit_files(), __exit_fs(), __exit_mm(). Which all take it just to
> serialize setting ot "mm/files/fs" to NULL.
>
> - task_lock is a nice local lock, no global scalability impact.

Don't we have to take tasklist_lock anyway for when task_state reads
p->real_parent->pid and p->parent->pid? Or should we have to grab
the task_lock for those too? If so, it sounds like it could get into
rather horrible ordering dependencies to me.

If we move exit under task_lock ... then tasklist_lock doesn't help us
any more there either (presumably we're trying to stop them exiting
whilst we look at them) ... I've coded up the stupid approach for now,
and will check that works first. Should have results very soon.

M.

2003-02-16 18:26:52

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET

>> The reason I'd _like_ to see the task lock held over _all_ of the fields
>> in the exit() path is:
>>
>> - right now we actually take it and drop it multiple times in exit. See
>> __exit_files(), __exit_fs(), __exit_mm(). Which all take it just to
>> serialize setting ot "mm/files/fs" to NULL.
>>
>> - task_lock is a nice local lock, no global scalability impact.
>
> Don't we have to take tasklist_lock anyway for when task_state reads
> p->real_parent->pid and p->parent->pid? Or should we have to grab
> the task_lock for those too? If so, it sounds like it could get into
> rather horrible ordering dependencies to me.
>
> If we move exit under task_lock ... then tasklist_lock doesn't help us
> any more there either (presumably we're trying to stop them exiting
> whilst we look at them) ... I've coded up the stupid approach for now,
> and will check that works first. Should have results very soon.

Hmmmm ... I guess we could keep real_parent->pid and parent->pid inside
the child's task_struct ... I can't see those change unless we're changing
real_parent / parent task pointers anyway. Not sure how much that idea
will make people vomit, but it would seem to make it easier to get rid
of the global locking, and it's probably nicer than trying to order the
aquistion of other people's task locks ...

M.

2003-02-16 18:56:52

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET

>> Don't see what I can do for this apart from to invert the ordering and take
>> tasklist_lock around the whole function, and nest task_lock inside that, or
>> I suppose I could take the task_lock for each of the parents? I seem to
>> recall Linus reminding people recently that it was only the lock
>> acquisition order that was important, not release ... does something like
>> the following look OK?
>
> This patch looks like it should certainly fix the problem, but that is
> still some god-awful ugly overkill in locking.
>
> I'd rather make the rule be that you have to take the task lock before
> modifying things like the parent pointers (and all the other fundamntal
> pointers), since that's already the rule for most of it anyway.
>
> And then the tasklist lock would go away _entirely_ from /proc (except for
> task lookup in ->readdir/->lookup, of course, where it is fundamentally
> needed and proper - and will probably some day be replaced by RCU, I
> suspect).

Well, I did the stupid safe thing, and it hangs the box once we get up to
a load of 32 with SDET. Below is what I did, the only other issue I can
see in here is that task_mem takes mm->mmap_sem which is now nested inside
the task_lock inside tasklist_lock ... but I can't see anywhere that's a
problem from a quick search

diff -urpN -X /home/fletch/.diff.exclude virgin/fs/proc/array.c sdet2/fs/proc/array.c
--- virgin/fs/proc/array.c Sat Feb 15 16:11:45 2003
+++ sdet2/fs/proc/array.c Sun Feb 16 09:59:24 2003
@@ -147,11 +147,11 @@ static inline const char * get_task_stat
return *p;
}

+/* Call me with the tasklist_lock and task_lock for p held already */
static inline char * task_state(struct task_struct *p, char *buffer)
{
int g;

- read_lock(&tasklist_lock);
buffer += sprintf(buffer,
"State:\t%s\n"
"Tgid:\t%d\n"
@@ -165,13 +165,10 @@ static inline char * task_state(struct t
p->pid && p->ptrace ? p->parent->pid : 0,
p->uid, p->euid, p->suid, p->fsuid,
p->gid, p->egid, p->sgid, p->fsgid);
- read_unlock(&tasklist_lock);
- task_lock(p);
buffer += sprintf(buffer,
"FDSize:\t%d\n"
"Groups:\t",
p->files ? p->files->max_fds : 0);
- task_unlock(p);

for (g = 0; g < p->ngroups; g++)
buffer += sprintf(buffer, "%d ", p->groups[g]);
@@ -243,20 +240,22 @@ extern char *task_mem(struct mm_struct *
int proc_pid_status(struct task_struct *task, char * buffer)
{
char * orig = buffer;
- struct mm_struct *mm = get_task_mm(task);

+ read_lock(&tasklist_lock);
+ task_lock(task);
buffer = task_name(task, buffer);
buffer = task_state(task, buffer);

- if (mm) {
- buffer = task_mem(mm, buffer);
- mmput(mm);
- }
- buffer = task_sig(task, buffer);
+ if (task->mm)
+ buffer = task_mem(task->mm, buffer);
+ if (task->sighand)
+ buffer = task_sig(task, buffer);
buffer = task_cap(task, buffer);
#if defined(CONFIG_ARCH_S390)
buffer = task_show_regs(task, buffer);
#endif
+ task_unlock(task);
+ read_unlock(&tasklist_lock);
return buffer - orig;
}





2003-02-16 19:09:10

by Manfred Spraul

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET

Martin J. Bligh wrote:

>Well, I did the stupid safe thing, and it hangs the box once we get up to
>a load of 32 with SDET. Below is what I did, the only other issue I can
>see in here is that task_mem takes mm->mmap_sem which is now nested inside
>the task_lock inside tasklist_lock ... but I can't see anywhere that's a
>problem from a quick search
>
>
task_lock is actually &spin_lock(tsk->alloc_lock); mmap_sem is a
semaphore. Nesting means deadlock.

--
Manfred

2003-02-16 19:10:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET


On Sun, 16 Feb 2003, Martin J. Bligh wrote:
>
> Well, I did the stupid safe thing, and it hangs the box once we get up to
> a load of 32 with SDET. Below is what I did, the only other issue I can
> see in here is that task_mem takes mm->mmap_sem which is now nested inside
> the task_lock inside tasklist_lock ... but I can't see anywhere that's a
> problem from a quick search

Ho - you can _never_ nest a semaphore inside a spinlock - if the semaphore
sleeps, the spinlock will cause a lockup regardless of any reverse nesting
issues.. So I guess the old "get_task_mm()" code is requried anyway.

Linus

2003-02-16 19:09:25

by Martin J. Bligh

[permalink] [raw]
Subject: more signal locking bugs?

task_lock nests *inside* tasklist_lock but ... :

__do_SAK:

task_lock
...
send_sig (SIGKILL, ...)
send_sig_info (SIGKILL, ...)
if (T(sig, SIG_KERNEL_BROADCAST_MASK)))
read_lock(&tasklist_lock);
...
read_unlock(&tasklist_lock);
...
task_unlock


#define SIG_KERNEL_BROADCAST_MASK (\
M(SIGHUP) | M(SIGINT) | M(SIGQUIT) | M(SIGILL) | \
M(SIGTRAP) | M(SIGABRT) | M(SIGBUS) | M(SIGFPE) | \
M(SIGKILL) | M(SIGUSR1) | M(SIGSEGV) | M(SIGUSR2) | \
M(SIGPIPE) | M(SIGALRM) | M(SIGTERM) | M(SIGXCPU) | \
M(SIGXFSZ) | M(SIGVTALRM) | M(SIGPROF) | M(SIGPOLL) | \
M(SIGSYS) | M_SIGSTKFLT | M(SIGPWR) | M(SIGCONT) | \
M(SIGSTOP) | M(SIGTSTP) | M(SIGTTIN) | M(SIGTTOU) | \
M_SIGEMT )

2003-02-16 19:17:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: more signal locking bugs?


On Sun, 16 Feb 2003, Martin J. Bligh wrote:
>
> task_lock nests *inside* tasklist_lock but ... :

Yeah, this is a problem with any signal, not just SAK.

In fact, in general it is always _wrong_ to nest a non-interrupt-safe
lock inside an interrupt-safe spinlock, because then you can never take
the non-interrupt-safe one on its own (because an interrupt may come in
and take the interrupt-safe lock, at which point that CPU has now
violated ordering).

And if you always nest the interrupt-unsafee one inside the interrupt-
safe one, you might as well not have the interrupt-unsafe one in the first
place, since the outer lock _always_ protects the code in question anyway.

Which implies that either
- the task lock should be _outside_ the tasklist_lock
or
- the task lock should be made interrupt-safe.

If we make the tasklock one interrupt-safe, that should fix the signal
issue, and we can use the tasklock to protect "task->signal" and
"task->sighand".

In short, everything really seems to be pointing that way: the current
task lock simply _is_ broken, and has apparently always been broken (but
the ABBA deadlock is just extremely rare in practice, since you have to
get an interrupt at just the right point on one CPU, while you have the AB
case on another).

Linus

2003-02-16 19:27:40

by Manfred Spraul

[permalink] [raw]
Subject: Re: more signal locking bugs?

Linus Torvalds wrote:

>In short, everything really seems to be pointing that way: the current
>task lock simply _is_ broken, and has apparently always been broken (but
>the ABBA deadlock is just extremely rare in practice, since you have to
>get an interrupt at just the right point on one CPU, while you have the AB
>case on another).\
>
ABBA is not a deadlock, because linux read_locks permit recursive calls.

read_lock(tasklist_lock);
task_lock(tsk);
read_lock(tasklist_lock);

Does not deadlock, nor any other ordering.

The tasklist_lock is never taken for write from bh or irq context.

--
Manfred

2003-02-16 19:36:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: more signal locking bugs?


On Sun, 16 Feb 2003, Manfred Spraul wrote:
>
> ABBA is not a deadlock, because linux read_locks permit recursive calls.
>
> read_lock(tasklist_lock);
> task_lock(tsk);
> read_lock(tasklist_lock);
>
> Does not deadlock, nor any other ordering.
>
> The tasklist_lock is never taken for write from bh or irq context.

Doesn't matter.

CPU1 - do_exit() CPU2 - non-nested task_lock()

task_lock(tsk)
write_lock_irq(&tasklist_lock); IRQ HAPPENS
task_lock(tsk); read_lock(&tasklist_lock)

BOOM, you're dead.

See? ABBA _does_ happen with the task lock, it's just that the magic
required to do so is fairly unlikely thanks to the added requirement for
the irq to happen at just the right moment (ie there are no static
code-paths that can cause it).

Linus

2003-02-16 19:55:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: more signal locking bugs?


On Sun, 16 Feb 2003, Linus Torvalds wrote:
>
> See? ABBA _does_ happen with the task lock, it's just that the magic
> required to do so is fairly unlikely thanks to the added requirement for
> the irq to happen at just the right moment (ie there are no static
> code-paths that can cause it).

Note: to clarify, this isn't in any way a new situation. As far as I can
tell, this deadlock exists in 2.4.x too. And it's almost certainly pretty
much impossible to trigger in practice, and as such we shouldn't need to
be deeply worried about it.

But it should make us think about the _design_ of locking in this region.
Clearly we got it wrong, and clearly we never noticed for several years.

The simple fix is to make the task-lock be IRQ-safe. That fixes it, and
that's probably the right minimal solution for for 2.4.x (unless the "fix"
for 2.4.x is to just ignore it since it's possible to trigger mostly in a
theoretical sense).

But assuming you accept that making the task-lock be irq-safe is the right
solution, then making it solve the signal handling and /proc scalability
issues is likely to be the right solution: since it's irq-safe, there's no
real reason not to use it to protect task->{sighand | signal | parent}
too.

Linus

2003-02-16 19:57:53

by Manfred Spraul

[permalink] [raw]
Subject: Re: more signal locking bugs?

On Sun, 16 Feb 2003, Linus Torvalds wrote:

>
> On Sun, 16 Feb 2003, Manfred Spraul wrote:
> >
> > ABBA is not a deadlock, because linux read_locks permit recursive calls.
> >
> > read_lock(tasklist_lock);
> > task_lock(tsk);
> > read_lock(tasklist_lock);
> >
> > Does not deadlock, nor any other ordering.
> >
> > The tasklist_lock is never taken for write from bh or irq context.
>
> Doesn't matter.
>
> CPU1 - do_exit() )
> write_lock_irq(&tasklist_lock);
> task_lock(tsk);
>
> BOOM, you're dead.
>
> See? ABBA _does_ happen with the task lock.
>
But these lines are not in 2.4 or 2.5.61.
The current rule to nesting tasklist_lock and task_lock is
- read_lock(&tasklist_lock) and task_lock can be mixed in any order.
- write_lock_irq(&tasklist_lock) and task_lock are incompatible.

What about this minimal patch? The performance critical operation is
signal delivery - we should fix the synchronization between signal
delivery and exec first.

--
Manfred
<<
--- 2.5/fs/proc/array.c 2003-02-15 10:29:22.000000000 +0100
+++ build-2.5/fs/proc/array.c 2003-02-16 20:58:37.000000000 +0100
@@ -208,23 +208,27 @@
{
sigset_t ign, catch;

- buffer += sprintf(buffer, "SigPnd:\t");
- buffer = render_sigset_t(&p->pending.signal, buffer);
- *buffer++ = '\n';
- buffer += sprintf(buffer, "ShdPnd:\t");
- buffer = render_sigset_t(&p->signal->shared_pending.signal, buffer);
- *buffer++ = '\n';
- buffer += sprintf(buffer, "SigBlk:\t");
- buffer = render_sigset_t(&p->blocked, buffer);
- *buffer++ = '\n';
-
- collect_sigign_sigcatch(p, &ign, &catch);
- buffer += sprintf(buffer, "SigIgn:\t");
- buffer = render_sigset_t(&ign, buffer);
- *buffer++ = '\n';
- buffer += sprintf(buffer, "SigCgt:\t"); /* Linux 2.0 uses "SigCgt" */
- buffer = render_sigset_t(&catch, buffer);
- *buffer++ = '\n';
+ read_lock(&tasklist_lock);
+ if (p->signal && p->sighand) {
+ buffer += sprintf(buffer, "SigPnd:\t");
+ buffer = render_sigset_t(&p->pending.signal, buffer);
+ *buffer++ = '\n';
+ buffer += sprintf(buffer, "ShdPnd:\t");
+ buffer = render_sigset_t(&p->signal->shared_pending.signal, buffer);
+ *buffer++ = '\n';
+ buffer += sprintf(buffer, "SigBlk:\t");
+ buffer = render_sigset_t(&p->blocked, buffer);
+ *buffer++ = '\n';
+
+ collect_sigign_sigcatch(p, &ign, &catch);
+ buffer += sprintf(buffer, "SigIgn:\t");
+ buffer = render_sigset_t(&ign, buffer);
+ *buffer++ = '\n';
+ buffer += sprintf(buffer, "SigCgt:\t"); /* Linux 2.0 uses "SigCgt" */
+ buffer = render_sigset_t(&catch, buffer);
+ *buffer++ = '\n';
+ }
+ read_unlock(&tasklist_lock);

return buffer;
}
<<

2003-02-16 20:04:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: more signal locking bugs?


On Sun, 16 Feb 2003, Manfred Spraul wrote:
> But these lines are not in 2.4 or 2.5.61.
> The current rule to nesting tasklist_lock and task_lock is
> - read_lock(&tasklist_lock) and task_lock can be mixed in any order.
> - write_lock_irq(&tasklist_lock) and task_lock are incompatible.

Oh, you're right, and you're right exactly _because_ "task->signal" isn't
protected by the task lock right now. Aurgh. I had already mentally done
that protection, which is why I thought we already had the bug.

So never mind. 2.4.x is obviously also ok.

> What about this minimal patch? The performance critical operation is
> signal delivery - we should fix the synchronization between signal
> delivery and exec first.

The patch looks ok, although I'd also remove the locking and testing from
collect_sigign_sigcatch() once it is done at a higher level.

And yeah, what about signal delivery? Put back the same lock there?

Linus

2003-02-16 20:13:50

by Manfred Spraul

[permalink] [raw]
Subject: Re: more signal locking bugs?

Linus Torvalds wrote:

>>What about this minimal patch? The performance critical operation is
>>signal delivery - we should fix the synchronization between signal
>>delivery and exec first.
>>
>>
>
>The patch looks ok, although I'd also remove the locking and testing from
>collect_sigign_sigcatch() once it is done at a higher level.
>
>And yeah, what about signal delivery? Put back the same lock there?
>
>
I don't know.
Taking read_lock(&tasklist_lock) for send_specific_sig_info might hurt
scalability. Ingo?

If we do not want a global lock, then we have two options:
- make task_lock an interrupt spinlock.
- add a second spinlock to the task structure, for the signal stuff.

Design question - what's worse? Memory bloat or a few additional
local_irq_{en,dis}able().
I don't care - no performance critical codepaths.
Additionally many task_lock()/task_unlock users could be replaced with
spin_unlock_wait(&task->alloc_lock), which would not need the
local_irq_disable().

--
Manfred

2003-02-16 21:05:28

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET

>> Well, I did the stupid safe thing, and it hangs the box once we get up to
>> a load of 32 with SDET. Below is what I did, the only other issue I can
>> see in here is that task_mem takes mm->mmap_sem which is now nested inside
>> the task_lock inside tasklist_lock ... but I can't see anywhere that's a
>> problem from a quick search
>
> Ho - you can _never_ nest a semaphore inside a spinlock - if the semaphore
> sleeps, the spinlock will cause a lockup regardless of any reverse nesting
> issues.. So I guess the old "get_task_mm()" code is requried anyway.

True ... staring at unfamiliar code made me forget a few things ;-)
OK, the below patch works, no oopses, no hangs. If this is acceptable
(even if it's not the final solution), could you apply it? Seems to
be better than the current situation, at any rate.

-----------------------------------------------------

We can encounter a race condition where task->sighand can be NULL in
proc_pid_status, resulting in an offset NULL pointer dereference in
render_sigset_t. This can be easily demonstrated running SDET with a
load of 32 to 64 on a large machine (16x NUMA-Q in this case).

This patch makes us take task_lock around the sighand deferences,
and fixes the oops in testing on the same machine. Thanks to Linus,
Andrew, Manfred and Zwane for lots of guidance ;-)

diff -urpN -X /home/fletch/.diff.exclude virgin/fs/proc/array.c sdet3/fs/proc/array.c
--- virgin/fs/proc/array.c Sat Feb 15 16:11:45 2003
+++ sdet3/fs/proc/array.c Sun Feb 16 11:44:24 2003
@@ -252,8 +252,11 @@ int proc_pid_status(struct task_struct *
buffer = task_mem(mm, buffer);
mmput(mm);
}
- buffer = task_sig(task, buffer);
+ task_lock(task);
+ if (task->sighand)
+ buffer = task_sig(task, buffer);
buffer = task_cap(task, buffer);
+ task_unlock(task);
#if defined(CONFIG_ARCH_S390)
buffer = task_show_regs(task, buffer);
#endif

2003-02-16 21:12:07

by Manfred Spraul

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET

Martin J. Bligh wrote:

>diff -urpN -X /home/fletch/.diff.exclude virgin/fs/proc/array.c sdet3/fs/proc/array.c
>--- virgin/fs/proc/array.c Sat Feb 15 16:11:45 2003
>+++ sdet3/fs/proc/array.c Sun Feb 16 11:44:24 2003
>@@ -252,8 +252,11 @@ int proc_pid_status(struct task_struct *
> buffer = task_mem(mm, buffer);
> mmput(mm);
> }
>- buffer = task_sig(task, buffer);
>+ task_lock(task);
>+ if (task->sighand)
>+ buffer = task_sig(task, buffer);
> buffer = task_cap(task, buffer);
>+ task_unlock(task);
> #if defined(CONFIG_ARCH_S390)
> buffer = task_show_regs(task, buffer);
> #endif
>
>
I think it's needed for 2.4, too.

--
Manfred

2003-02-16 22:28:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET


On Sun, 16 Feb 2003, Manfred Spraul wrote:
> Martin J. Bligh wrote:
>
> >diff -urpN -X /home/fletch/.diff.exclude virgin/fs/proc/array.c sdet3/fs/proc/array.c
> >--- virgin/fs/proc/array.c Sat Feb 15 16:11:45 2003
> >+++ sdet3/fs/proc/array.c Sun Feb 16 11:44:24 2003
> >@@ -252,8 +252,11 @@ int proc_pid_status(struct task_struct *
> > buffer = task_mem(mm, buffer);
> > mmput(mm);
> > }
> >- buffer = task_sig(task, buffer);
> >+ task_lock(task);
> >+ if (task->sighand)
> >+ buffer = task_sig(task, buffer);
> > buffer = task_cap(task, buffer);
> >+ task_unlock(task);
> > #if defined(CONFIG_ARCH_S390)
> > buffer = task_show_regs(task, buffer);
> > #endif
> >
> >
> I think it's needed for 2.4, too.

It's not wrong, but it shouldn't help. Simply because "task_lock()"
isn't even relevant to "task->sighand" as it stands now. It will be
cleared without holding the task lock, as far as I can see.

So I suspect it fixes things for Martin only because it changes timings
enough not to hit the race.

Linus

2003-02-16 22:59:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET

>> > diff -urpN -X /home/fletch/.diff.exclude virgin/fs/proc/array.c sdet3/fs/proc/array.c
>> > --- virgin/fs/proc/array.c Sat Feb 15 16:11:45 2003
>> > +++ sdet3/fs/proc/array.c Sun Feb 16 11:44:24 2003
>> > @@ -252,8 +252,11 @@ int proc_pid_status(struct task_struct *
>> > buffer = task_mem(mm, buffer);
>> > mmput(mm);
>> > }
>> > - buffer = task_sig(task, buffer);
>> > + task_lock(task);
>> > + if (task->sighand)
>> > + buffer = task_sig(task, buffer);
>> > buffer = task_cap(task, buffer);
>> > + task_unlock(task);
>> > # if defined(CONFIG_ARCH_S390)
>> > buffer = task_show_regs(task, buffer);
>> > # endif
>> >
>> >
>> I think it's needed for 2.4, too.
>
> It's not wrong, but it shouldn't help. Simply because "task_lock()"
> isn't even relevant to "task->sighand" as it stands now. It will be
> cleared without holding the task lock, as far as I can see.
>
> So I suspect it fixes things for Martin only because it changes timings
> enough not to hit the race.

Ah, fair enough ... it's probably the if, rather than the task_lock.

So what does protect sighand? tasklist_lock? It doesn't seem to
as people do things like:

spin_unlock_irq(&current->sighand->siglock);

all the time ... so is it just protected by good faith and the direction
of the wind?

M.

2003-02-16 23:26:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fw: 2.5.61 oops running SDET


On Sun, 16 Feb 2003, Martin J. Bligh wrote:
>
> So what does protect sighand? tasklist_lock?

Yup. At least right now. And we do tend to hold tasklist_lock in most
cases simply by virtue of having to get it anyway in order to search for
the process.

> It doesn't seem to
> as people do things like:
>
> spin_unlock_irq(&current->sighand->siglock);
>
> all the time ... so is it just protected by good faith and the direction
> of the wind?

Agreed. That, and the fact that most of the time the stuff _is_ there:
obviously any time the process looks at its own signals it will always be
there.

So we have two cases:
- "normal" signal sending which holds tasklist_lock to find the target.
- signals to the process itself, which is usually safe, since we know
we're there (the exception would be if/when taking a fault in the exit
path, but even that might be ok since signals are torn down not by exit
itself but by "release_task")

So the signal path may actually be ok.

Linus

2003-02-17 00:16:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: more signal locking bugs?


On Sun, 16 Feb 2003, Manfred Spraul wrote:
>
> What about this minimal patch? The performance critical operation is
> signal delivery - we should fix the synchronization between signal
> delivery and exec first.

Ok, I committed this alternative change, which isn't quite as minimal, but
looks a lot cleaner to me.

Also, looking at execve() and other paths, we do seem to have sufficient
protection from the tasklist_lock that signal delivery should be fine. So
despite a long and confused thread, I think in the end the only real bug
was the one Martin found which should be thus fixed..

Linus

---
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.1054 -> 1.1055
# include/linux/signal.h 1.8 -> 1.9
# fs/proc/array.c 1.42 -> 1.43
# kernel/sched.c 1.158 -> 1.159
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/02/16 [email protected] 1.1055
# Clean up and fix locking around signal rendering
# --------------------------------------------
#
diff -Nru a/fs/proc/array.c b/fs/proc/array.c
--- a/fs/proc/array.c Sun Feb 16 16:22:45 2003
+++ b/fs/proc/array.c Sun Feb 16 16:22:45 2003
@@ -180,51 +180,74 @@
return buffer;
}

+static char * render_sigset_t(const char *header, sigset_t *set, char *buffer)
+{
+ int i, len;
+
+ len = strlen(header);
+ memcpy(buffer, header, len);
+ buffer += len;
+
+ i = _NSIG;
+ do {
+ int x = 0;
+
+ i -= 4;
+ if (sigismember(set, i+1)) x |= 1;
+ if (sigismember(set, i+2)) x |= 2;
+ if (sigismember(set, i+3)) x |= 4;
+ if (sigismember(set, i+4)) x |= 8;
+ *buffer++ = (x < 10 ? '0' : 'a' - 10) + x;
+ } while (i >= 4);
+
+ *buffer++ = '\n';
+ *buffer = 0;
+ return buffer;
+}
+
static void collect_sigign_sigcatch(struct task_struct *p, sigset_t *ign,
sigset_t *catch)
{
struct k_sigaction *k;
int i;

- sigemptyset(ign);
- sigemptyset(catch);
+ k = p->sighand->action;
+ for (i = 1; i <= _NSIG; ++i, ++k) {
+ if (k->sa.sa_handler == SIG_IGN)
+ sigaddset(ign, i);
+ else if (k->sa.sa_handler != SIG_DFL)
+ sigaddset(catch, i);
+ }
+}
+
+static inline char * task_sig(struct task_struct *p, char *buffer)
+{
+ sigset_t pending, shpending, blocked, ignored, caught;
+
+ sigemptyset(&pending);
+ sigemptyset(&shpending);
+ sigemptyset(&blocked);
+ sigemptyset(&ignored);
+ sigemptyset(&caught);

+ /* Gather all the data with the appropriate locks held */
read_lock(&tasklist_lock);
if (p->sighand) {
spin_lock_irq(&p->sighand->siglock);
- k = p->sighand->action;
- for (i = 1; i <= _NSIG; ++i, ++k) {
- if (k->sa.sa_handler == SIG_IGN)
- sigaddset(ign, i);
- else if (k->sa.sa_handler != SIG_DFL)
- sigaddset(catch, i);
- }
+ pending = p->pending.signal;
+ shpending = p->signal->shared_pending.signal;
+ blocked = p->blocked;
+ collect_sigign_sigcatch(p, &ignored, &caught);
spin_unlock_irq(&p->sighand->siglock);
}
read_unlock(&tasklist_lock);
-}
-
-static inline char * task_sig(struct task_struct *p, char *buffer)
-{
- sigset_t ign, catch;

- buffer += sprintf(buffer, "SigPnd:\t");
- buffer = render_sigset_t(&p->pending.signal, buffer);
- *buffer++ = '\n';
- buffer += sprintf(buffer, "ShdPnd:\t");
- buffer = render_sigset_t(&p->signal->shared_pending.signal, buffer);
- *buffer++ = '\n';
- buffer += sprintf(buffer, "SigBlk:\t");
- buffer = render_sigset_t(&p->blocked, buffer);
- *buffer++ = '\n';
-
- collect_sigign_sigcatch(p, &ign, &catch);
- buffer += sprintf(buffer, "SigIgn:\t");
- buffer = render_sigset_t(&ign, buffer);
- *buffer++ = '\n';
- buffer += sprintf(buffer, "SigCgt:\t"); /* Linux 2.0 uses "SigCgt" */
- buffer = render_sigset_t(&catch, buffer);
- *buffer++ = '\n';
+ /* render them all */
+ buffer = render_sigset_t("SigPnd:\t", &pending, buffer);
+ buffer = render_sigset_t("ShdPnd:\t", &shpending, buffer);
+ buffer = render_sigset_t("SigBlk:\t", &blocked, buffer);
+ buffer = render_sigset_t("SigIgn:\t", &ignored, buffer);
+ buffer = render_sigset_t("SigCgt:\t", &caught, buffer);

return buffer;
}
diff -Nru a/include/linux/signal.h b/include/linux/signal.h
--- a/include/linux/signal.h Sun Feb 16 16:22:45 2003
+++ b/include/linux/signal.h Sun Feb 16 16:22:45 2003
@@ -151,8 +151,6 @@
}
}

-extern char * render_sigset_t(sigset_t *set, char *buffer);
-
/* Some extensions for manipulating the low 32 signals in particular. */

static inline void sigaddsetmask(sigset_t *set, unsigned long mask)
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Sun Feb 16 16:22:45 2003
+++ b/kernel/sched.c Sun Feb 16 16:22:45 2003
@@ -2095,21 +2095,6 @@
}
}

-char * render_sigset_t(sigset_t *set, char *buffer)
-{
- int i = _NSIG, x;
- do {
- i -= 4, x = 0;
- if (sigismember(set, i+1)) x |= 1;
- if (sigismember(set, i+2)) x |= 2;
- if (sigismember(set, i+3)) x |= 4;
- if (sigismember(set, i+4)) x |= 8;
- *buffer++ = (x < 10 ? '0' : 'a' - 10) + x;
- } while (i >= 4);
- *buffer = 0;
- return buffer;
-}
-
void show_state(void)
{
task_t *g, *p;

2003-02-17 01:55:23

by Martin J. Bligh

[permalink] [raw]
Subject: Re: more signal locking bugs?

> Ok, I committed this alternative change, which isn't quite as minimal, but
> looks a lot cleaner to me.
>
> Also, looking at execve() and other paths, we do seem to have sufficient
> protection from the tasklist_lock that signal delivery should be fine. So
> despite a long and confused thread, I think in the end the only real bug
> was the one Martin found which should be thus fixed..

Ran your patch ... but I get plenty of these now:

Unable to handle kernel NULL pointer dereference at virtual address 00000004
printing eip:
c016d5a8
*pde = 2e10f001
*pte = 00000000
Oops: 0000
CPU: 2
EIP: 0060:[<c016d5a8>] Not tainted
EFLAGS: 00010202
EIP is at collect_sigign_sigcatch+0x1c/0x44
eax: ed6d72c0 ebx: ed403f50 ecx: 00000004 edx: 00000001
esi: ed403f48 edi: ed6d72c0 ebp: 00000000 esp: ed403eec
ds: 007b es: 007b ss: 0068
Process ps (pid: 19988, threadinfo=ed402000 task=ed9352a0)
Stack: c011d115 c02af820 c016dab8 ed6d72c0 ed403f48 ed403f50 ed6d72c0 edb99a50
ed6d72c0 eebf33c0 ee225000 c034e400 ed403f50 ed403f48 00000000 52000000
00008800 00000125 eebf33c0 eebf33e0 00000000 00000000 00000000 000000d0
Call Trace:
[<c011d115>] do_exit+0x30d/0x31c
[<c016dab8>] proc_pid_stat+0x110/0x324
[<c0130076>] __get_free_pages+0x4e/0x54
[<c016b497>] proc_info_read+0x53/0x130
[<c0145215>] vfs_read+0xa5/0x128
[<c01454a2>] sys_read+0x2a/0x3c
[<c0108b3f>] syscall_call+0x7/0xb

Code: 8b 01 83 f8 01 75 08 8d 42 ff 0f ab 06 eb 0a 85 c0 74 06 8d

2003-02-17 02:29:26

by Martin J. Bligh

[permalink] [raw]
Subject: Re: more signal locking bugs?

>> Ok, I committed this alternative change, which isn't quite as minimal, but
>> looks a lot cleaner to me.
>>
>> Also, looking at execve() and other paths, we do seem to have sufficient
>> protection from the tasklist_lock that signal delivery should be fine. So
>> despite a long and confused thread, I think in the end the only real bug
>> was the one Martin found which should be thus fixed..
>
> Ran your patch ... but I get plenty of these now:
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000004
> printing eip:
> c016d5a8
> *pde = 2e10f001
> *pte = 00000000
> Oops: 0000
> CPU: 2
> EIP: 0060:[<c016d5a8>] Not tainted
> EFLAGS: 00010202
> EIP is at collect_sigign_sigcatch+0x1c/0x44
> eax: ed6d72c0 ebx: ed403f50 ecx: 00000004 edx: 00000001
> esi: ed403f48 edi: ed6d72c0 ebp: 00000000 esp: ed403eec
> ds: 007b es: 007b ss: 0068
> Process ps (pid: 19988, threadinfo=ed402000 task=ed9352a0)
> Stack: c011d115 c02af820 c016dab8 ed6d72c0 ed403f48 ed403f50 ed6d72c0 edb99a50
> ed6d72c0 eebf33c0 ee225000 c034e400 ed403f50 ed403f48 00000000 52000000
> 00008800 00000125 eebf33c0 eebf33e0 00000000 00000000 00000000 000000d0
> Call Trace:
> [<c011d115>] do_exit+0x30d/0x31c
> [<c016dab8>] proc_pid_stat+0x110/0x324
> [<c0130076>] __get_free_pages+0x4e/0x54
> [<c016b497>] proc_info_read+0x53/0x130
> [<c0145215>] vfs_read+0xa5/0x128
> [<c01454a2>] sys_read+0x2a/0x3c
> [<c0108b3f>] syscall_call+0x7/0xb
>
> Code: 8b 01 83 f8 01 75 08 8d 42 ff 0f ab 06 eb 0a 85 c0 74 06 8d

Ah, I see what happened, I think .... the locking used to be inside
collect_sigign_sigcatch, and you moved it out into task_sig ... but
there were two callers of collect_sigign_sigcatch, the other one being
proc_pid_stat ... which now needs to have appropriate locking added to
it as well to match the change ... I'll see if I can get something
together that works for that.

M.

2003-02-17 03:46:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: more signal locking bugs?


On Sun, 16 Feb 2003, Martin J. Bligh wrote:
>
> Ah, I see what happened, I think .... the locking used to be inside
> collect_sigign_sigcatch, and you moved it out into task_sig ... but
> there were two callers of collect_sigign_sigcatch, the other one being
> proc_pid_stat

Doh.

This should fix it.

Linus

---
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.1055 -> 1.1056
# fs/proc/array.c 1.43 -> 1.44
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/02/16 [email protected] 1.1056
# Do proper signal locking for the old-style /proc/stat too.
# --------------------------------------------
#
diff -Nru a/fs/proc/array.c b/fs/proc/array.c
--- a/fs/proc/array.c Sun Feb 16 19:52:30 2003
+++ b/fs/proc/array.c Sun Feb 16 19:52:30 2003
@@ -316,7 +316,13 @@

wchan = get_wchan(task);

- collect_sigign_sigcatch(task, &sigign, &sigcatch);
+ read_lock(&tasklist_lock);
+ if (task->sighand) {
+ spin_lock_irq(&task->sighand->siglock);
+ collect_sigign_sigcatch(task, &sigign, &sigcatch);
+ spin_lock_irq(&task->sighand->siglock);
+ }
+ read_unlock(&tasklist_lock);

/* scale priority and nice values from timeslices to -20..20 */
/* to make it look like a "normal" Unix priority/nice value */

2003-02-17 03:44:30

by Martin J. Bligh

[permalink] [raw]
Subject: [PATCH] fix secondary oops in sighand locking

>> Ran your patch ... but I get plenty of these now:
>>
>> Unable to handle kernel NULL pointer dereference at virtual address 00000004
>> ...
>
> Ah, I see what happened, I think .... the locking used to be inside
> collect_sigign_sigcatch, and you moved it out into task_sig ... but
> there were two callers of collect_sigign_sigcatch, the other one being
> proc_pid_stat ... which now needs to have appropriate locking added to
> it as well to match the change ... I'll see if I can get something
> together that works for that.

OK, with this small extra patch, we now survive a beating with SDET
without oops or freeze. The patch adds the locking that was removed
from collect_sigign_sigcatch to the second caller (proc_pid_stat),
as well as task_sig.

diff -urpN -X /home/fletch/.diff.exclude linus-sdet/fs/proc/array.c linus-sdet-fix/fs/proc/array.c
--- linus-sdet/fs/proc/array.c Sun Feb 16 18:48:04 2003
+++ linus-sdet-fix/fs/proc/array.c Sun Feb 16 18:51:34 2003
@@ -316,7 +316,12 @@ int proc_pid_stat(struct task_struct *ta

wchan = get_wchan(task);

- collect_sigign_sigcatch(task, &sigign, &sigcatch);
+ sigemptyset(&sigign);
+ sigemptyset(&sigcatch);
+ read_lock(&tasklist_lock);
+ if (task->sighand)
+ collect_sigign_sigcatch(task, &sigign, &sigcatch);
+ read_unlock(&tasklist_lock);

/* scale priority and nice values from timeslices to -20..20 */
/* to make it look like a "normal" Unix priority/nice value */

2003-02-17 04:57:28

by Martin J. Bligh

[permalink] [raw]
Subject: Re: more signal locking bugs?

>> Ah, I see what happened, I think .... the locking used to be inside
>> collect_sigign_sigcatch, and you moved it out into task_sig ... but
>> there were two callers of collect_sigign_sigcatch, the other one being
>> proc_pid_stat
>
> Doh.
>
> This should fix it.

Don't you need this bit as well?

+ sigemptyset(&sigign);
+ sigemptyset(&sigcatch);

to replace this bit from your previous patch.

static void collect_sigign_sigcatch(struct task_struct *p, sigset_t *ign,
sigset_t *catch)
{
struct k_sigaction *k;
int i;

- sigemptyset(ign);
- sigemptyset(catch);

That was in the patch I sent you ... but I missed task->sighand->siglock ;-)

M.

2003-02-17 06:07:33

by Martin J. Bligh

[permalink] [raw]
Subject: Re: more signal locking bugs?

>> Ah, I see what happened, I think .... the locking used to be inside
>> collect_sigign_sigcatch, and you moved it out into task_sig ... but
>> there were two callers of collect_sigign_sigcatch, the other one being
>> proc_pid_stat
>
> Doh.
>
> This should fix it.

Oooh, not only does SDET work now in 61, it doesn't freeze the whole box
when I hit ^C any more (like it's been doing since the dawn of time).
Spiffy ;-)


M.

2003-02-17 06:27:25

by Manfred Spraul

[permalink] [raw]
Subject: Re: more signal locking bugs?

Linus Torvalds wrote:

>On Sun, 16 Feb 2003, Manfred Spraul wrote:
>
>
>>What about this minimal patch? The performance critical operation is
>>signal delivery - we should fix the synchronization between signal
>>delivery and exec first.
>>
>>
>
>Ok, I committed this alternative change, which isn't quite as minimal, but
>looks a lot cleaner to me.
>
>Also, looking at execve() and other paths, we do seem to have sufficient
>protection from the tasklist_lock that signal delivery should be fine. So
>despite a long and confused thread, I think in the end the only real bug
>was the one Martin found which should be thus fixed..
>
>
I'm not convinced that exec is correct.
app with two threads, cloned sighand and sig structures.
thread one does exec().
thread two does exit().

Now we can arrive at no_thread_group in de_thread() and
tsk->sig{,hand}->count == 1.

>no_thread_group:
>
> write_lock_irq(&tasklist_lock);
> spin_lock(&oldsighand->siglock);
> spin_lock(&newsighand->siglock);
>
> if (current == oldsig->curr_target)
> oldsig->curr_target = next_thread(current);
>
signal sender: in send_sig_info().
reads tsk->signal. blocks on tsk->sighand->siglock.

> if (newsig)
> current->signal = newsig;
> current->sighand = newsighand;
> init_sigpending(&current->pending);
> recalc_sigpending();
>
> spin_unlock(&newsighand->siglock);
> spin_unlock(&oldsighand->siglock);
> write_unlock_irq(&tasklist_lock);
>
> if (newsig && atomic_dec_and_test(&oldsig->count))
> kmem_cache_free(signal_cachep, oldsig);
>
> if (atomic_dec_and_test(&oldsighand->count))
> kmem_cache_free(sighand_cachep, oldsighand);
>
>
And now the signal sender continues. BOOM.
sighand structure, sig structure already kfreed, etc.

--
Manfred

2003-02-17 16:59:22

by Magnus Naeslund(f)

[permalink] [raw]
Subject: Re: more signal locking bugs?

Linus Torvalds <[email protected]> wrote:
>
> Doh.
>
> This should fix it.

[snip]

> + spin_lock_irq(&task->sighand->siglock);
> + collect_sigign_sigcatch(task, &sigign, &sigcatch);
> + spin_lock_irq(&task->sighand->siglock);

[snip]

You take the lock twice here?

Magnus

2003-02-17 19:00:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: more signal locking bugs?


On Mon, 17 Feb 2003, Manfred Spraul wrote:
>
> I'm not convinced that exec is correct.

No, I think you're right. And I think just fixing send_sig_info() to take
the tasklist lock is the right thing to do.

That still leaves force_sig_info() without the tasklist lock, but since
that is only used for page faults etc synchronous errors, that's fine (if
we get a synchronous error in the kernel, we have bigger problems than
signal locking).

Linus