Andrew,
This patch fixes a race where timer-generated signals are delivered to an
exiting process, after task->sighand is cleared.
update_one_process() declared static, as it is only used in kernel/timer.c
Signed-off-by: Jeremy Kerr <[email protected]>
Jeremy
diff -urN linux-2.6.7-rc2-bk2.orig/include/linux/sched.h
linux-2.6.7-rc2-bk2/include/linux/sched.h
--- linux-2.6.7-rc2-bk2.orig/include/linux/sched.h 2004-06-02
11:29:13.000000000 +1000
+++ linux-2.6.7-rc2-bk2/include/linux/sched.h 2004-06-02 11:46:57.000000000
+1000
@@ -168,8 +168,6 @@
extern void cpu_init (void);
extern void trap_init(void);
extern void update_process_times(int user);
-extern void update_one_process(struct task_struct *p, unsigned long user,
- unsigned long system, int cpu);
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;
Binary files linux-2.6.7-rc2-bk2.orig/kernel/.signal.c.swp and
linux-2.6.7-rc2-bk2/kernel/.signal.c.swp differ
Binary files linux-2.6.7-rc2-bk2.orig/kernel/.timer.c.swp and
linux-2.6.7-rc2-bk2/kernel/.timer.c.swp differ
diff -urN linux-2.6.7-rc2-bk2.orig/kernel/signal.c
linux-2.6.7-rc2-bk2/kernel/signal.c
--- linux-2.6.7-rc2-bk2.orig/kernel/signal.c 2004-06-02 11:29:13.000000000
+1000
+++ linux-2.6.7-rc2-bk2/kernel/signal.c 2004-06-02 11:47:28.000000000 +1000
@@ -323,7 +323,10 @@
{
struct sighand_struct * sighand = tsk->sighand;
- /* Ok, we're done with the signal handlers */
+ /* Ok, we're done with the signal handlers.
+ * Set sighand to NULL to tell kernel/timer.c not
+ * to deliver further signals to this task
+ */
tsk->sighand = NULL;
if (atomic_dec_and_test(&sighand->count))
kmem_cache_free(sighand_cachep, sighand);
diff -urN linux-2.6.7-rc2-bk2.orig/kernel/timer.c
linux-2.6.7-rc2-bk2/kernel/timer.c
--- linux-2.6.7-rc2-bk2.orig/kernel/timer.c 2004-06-02 11:29:13.000000000
+1000
+++ linux-2.6.7-rc2-bk2/kernel/timer.c 2004-06-02 11:47:08.000000000 +1000
@@ -829,7 +829,7 @@
}
}
-void update_one_process(struct task_struct *p, unsigned long user,
+static void update_one_process(struct task_struct *p, unsigned long user,
unsigned long system, int cpu)
{
do_process_times(p, user, system);
@@ -846,7 +846,9 @@
struct task_struct *p = current;
int cpu = smp_processor_id(), system = user_tick ^ 1;
- update_one_process(p, user_tick, system, cpu);
+ /* Don't send signals to current after release_task() */
+ if (likely(p->sighand))
+ update_one_process(p, user_tick, system, cpu);
run_local_timers();
scheduler_tick(user_tick, system);
}
Jeremy Kerr <[email protected]> wrote:
>
> This patch fixes a race where timer-generated signals are delivered to an
> exiting process, after task->sighand is cleared.
Nasty. I'm surprised that we haven't hit this more frequently. I guess
timer-generated signals aren't very common.
However I'm not sure that your fix is complete:
void update_process_times(int user_tick)
{
struct task_struct *p = current;
int cpu = smp_processor_id(), system = user_tick ^ 1;
/* Don't send signals to current after release_task() */
if (likely(p->sighand))
update_one_process(p, user_tick, system, cpu);
versus:
void __exit_sighand(struct task_struct *tsk)
{
struct sighand_struct * sighand = tsk->sighand;
/* Ok, we're done with the signal handlers.
* Set sighand to NULL to tell kernel/timer.c not
* to deliver further signals to this task
*/
tsk->sighand = NULL;
if (atomic_dec_and_test(&sighand->count))
kmem_cache_free(sighand_cachep, sighand);
If these two functions are running on different CPUs then the race is still
there - exit_sighand() can call update_process_times() while __exit_sighand
is throwing away p->sighand.
Question is, can these functions run on separate CPUs? Certainly a
different CPU can run release_task(), via wait4().
And there's a little window at the end of exit_notify() where the exitting
task (which is still "current" on its CPU) can take a timer interrupt while
in a state TASK_ZOMBIE. The CPU which is running wait4() will run
release_task() for the exitting task and the above race can occur.
(And if the exitting task is being ptraced things get more complex..)
Did I miss something?
It seems silly to be trying to deliver timer signals to processes which are
so late in exit and we could perhaps set ->it_prof_value and
->it_virt_value to zero earlier in exit. That's sane, but doesn't fix the
race.
Right now, I see no alternative to adding locking which pins task->sighand
while the timer handler is running. Taking tasklist_lock on each timer
tick will hurt - maybe a new per-process lock is needed?
On Wed, 2004-06-02 at 15:57, Andrew Morton wrote:
> void update_process_times(int user_tick)
> {
> struct task_struct *p = current;
> int cpu = smp_processor_id(), system = user_tick ^ 1;
> versus:
>
> void __exit_sighand(struct task_struct *tsk)
> {
> struct sighand_struct * sighand = tsk->sighand;
No. tsk == current for __exit_sighand. You know, getting current is SO
expensive, and so we should PASS IT to functions explicitly!
One of my pet gripes: this code is badly obfuscated by this. Is it just
me?
> And there's a little window at the end of exit_notify() where the exitting
> task (which is still "current" on its CPU) can take a timer interrupt while
> in a state TASK_ZOMBIE. The CPU which is running wait4() will run
> release_task() for the exitting task and the above race can occur.
Hmm, while we're at it, the task seems to release itself while running
here: exit_notify() -> release_task() -> put_task_struct() ->
__put_task_struct() -> BOOM?
Surely not, what am I missing?
> Right now, I see no alternative to adding locking which pins task->sighand
> while the timer handler is running. Taking tasklist_lock on each timer
> tick will hurt - maybe a new per-process lock is needed?
Hmm, a per-cpu cache of exited tasks: one task for each CPU. We hold a
reference to the task struct until the next exit on the same CPU
happens? We could also reuse that cache for fork()...
Rusty.
--
Anyone who quotes me in their signature is an idiot -- Rusty Russell
Rusty Russell <[email protected]> wrote:
>
> On Wed, 2004-06-02 at 15:57, Andrew Morton wrote:
> > void update_process_times(int user_tick)
> > {
> > struct task_struct *p = current;
> > int cpu = smp_processor_id(), system = user_tick ^ 1;
> > versus:
> >
> > void __exit_sighand(struct task_struct *tsk)
> > {
> > struct sighand_struct * sighand = tsk->sighand;
>
> No. tsk == current for __exit_sighand.
Nope.
wait4->wait_task_zombie->release_task->__exit_sighand
> You know, getting current is SO
> expensive, and so we should PASS IT to functions explicitly!
>
> One of my pet gripes: this code is badly obfuscated by this. Is it just
> me?
It does make the code a bit hard to follow, but yes, evaluating `current'
takes ~14 bytes of text on x86. Although I think recent versions of the
compiler are able to cse it.
Still, the right way to do this is to lose all the dopey `p's and `tsk's
and adopt a convention of using `this_task' or some such identifier.
> > And there's a little window at the end of exit_notify() where the exitting
> > task (which is still "current" on its CPU) can take a timer interrupt while
> > in a state TASK_ZOMBIE. The CPU which is running wait4() will run
> > release_task() for the exitting task and the above race can occur.
>
> Hmm, while we're at it, the task seems to release itself while running
> here: exit_notify() -> release_task() -> put_task_struct() ->
> __put_task_struct() -> BOOM?
>
> Surely not, what am I missing?
There's still one put to go - that happens at the end of
finish_task_switch(), via the next-to-run process.
> > Right now, I see no alternative to adding locking which pins task->sighand
> > while the timer handler is running. Taking tasklist_lock on each timer
> > tick will hurt - maybe a new per-process lock is needed?
>
> Hmm, a per-cpu cache of exited tasks: one task for each CPU. We hold a
> reference to the task struct until the next exit on the same CPU
> happens? We could also reuse that cache for fork()...
The problem is task_struct.sighand, not the task_struct itself. ->sighand
can be thrown away while update_process_times() is still poking at it.
hmm, maybe it _is_ sufficient to null out current->it_prof_value and
current->it_virt_incr in the do_exit() path. Because in *this* case we know
that it is always the exitting task which is doing the nulling, so the
timer interrupt cannot run on a different CPU.
--- 25/kernel/exit.c~a 2004-06-02 00:06:33.864102384 -0700
+++ 25-akpm/kernel/exit.c 2004-06-02 00:07:12.050297200 -0700
@@ -749,6 +749,13 @@ static void exit_notify(struct task_stru
* complete, and with interrupts blocked that will never happen.
*/
_raw_write_unlock(&tasklist_lock);
+
+ /*
+ * comment goes here
+ */
+ tsk->it_virt_incr = 0;
+ tsk->it_prof_value = 0;
+
local_irq_enable();
/* If the process is dead, release it - nobody will wait for it */
_
yes?
Andrew Morton <[email protected]> wrote:
>
> yes?
no. It needs tasklist_lock as well, to keep the other CPU (which is doing
wait4) at bay.
Fix a race identified by Jeremy Kerr <[email protected]>: if
update_process_times() decides to deliver a signal due to process timer
expiry, it can race with __exit_sighand()'s freeing of task->sighand.
Fix that by clearing the per-process timer state in exit_notify(), while under
local_irq_disable() and under tasklist_lock. tasklist_lock provides exclusion
wrt release_task()'s freeing of task->sighand and local_irq_disable() provides
exclusion wrt update_process_times()'s inspection of the per-process timer
state.
Signed-off-by: Andrew Morton <[email protected]>
---
25-akpm/kernel/exit.c | 7 +++++++
1 files changed, 7 insertions(+)
diff -puN kernel/exit.c~really-fix-signal-race-during-process-exit kernel/exit.c
--- 25/kernel/exit.c~really-fix-signal-race-during-process-exit 2004-06-02 00:09:01.491659584 -0700
+++ 25-akpm/kernel/exit.c 2004-06-02 00:15:31.230410288 -0700
@@ -737,6 +737,13 @@ static void exit_notify(struct task_stru
tsk->flags |= PF_DEAD;
/*
+ * Clear these here so that update_process_times() won't try to deliver
+ * itimer signals to this task while it is in late exit.
+ */
+ tsk->it_virt_incr = 0;
+ tsk->it_prof_value = 0;
+
+ /*
* In the preemption case it must be impossible for the task
* to get runnable again, so use "_raw_" unlock to keep
* preempt_count elevated until we schedule().
_
> Andrew Morton <[email protected]> wrote:
> > yes?
>
> no. It needs tasklist_lock as well, to keep the other CPU (which is doing
> wait4) at bay.
almost:
> + tsk->it_virt_incr = 0;
> + tsk->it_prof_value = 0;
If we're using this approach, we also need to deal with the send_sig() calls
in do_process_times():
if (psecs / HZ > p->rlim[RLIMIT_CPU].rlim_cur) {
/* Send SIGXCPU every second.. */
if (!(psecs % HZ))
send_sig(SIGXCPU, p, 1);
/* and SIGKILL when we go over max.. */
if (psecs / HZ > p->rlim[RLIMIT_CPU].rlim_max)
send_sig(SIGKILL, p, 1);
}
by setting rlim_cur to RLIM_INFINITY.
Fix a race identified by Jeremy Kerr <[email protected]>: if
update_process_times() decides to deliver a signal due to process timer
expiry, it can race with __exit_sighand()'s freeing of task->sighand.
Fix that by clearing the per-process timer state in exit_notify(), while under
local_irq_disable() and under tasklist_lock. tasklist_lock provides exclusion
wrt release_task()'s freeing of task->sighand and local_irq_disable() provides
exclusion wrt update_process_times()'s inspection of the per-process timer
state.
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Jeremy Kerr <[email protected]>
diff -urN --exclude '.*.sw[op]' linux-2.6.7-rc2-bk2.orig/kernel/exit.c
linux-2.6.7-rc2-bk2/kernel/exit.c
--- linux-2.6.7-rc2-bk2.orig/kernel/exit.c 2004-06-02 11:29:13.000000000 +1000
+++ linux-2.6.7-rc2-bk2/kernel/exit.c 2004-06-02 18:02:05.000000000 +1000
@@ -736,6 +736,14 @@
tsk->state = state;
tsk->flags |= PF_DEAD;
+ /*
+ * Clear these here so that update_process_times() won't try to deliver
+ * itimer, profile or rlimit signals to this task while it is in late exit.
+ */
+ tsk->it_virt_incr = 0;
+ tsk->it_prof_value = 0;
+ tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY;
+
/*
* In the preemption case it must be impossible for the task
* to get runnable again, so use "_raw_" unlock to keep
Jeremy
Is there a reproducer case around so we can test fixes for this problem?
It seems to me that signals sent to an already dying task might as well
just be discarded anyway. All they ever do now (except for trip bugs) is
change what pending signals you see in the /proc/pid/status entry for a
zombie. What's wrong with this:
Index: linux-2.6/kernel/signal.c
===================================================================
RCS file: /home/roland/redhat/bkcvs/linux-2.5/kernel/signal.c,v
retrieving revision 1.120
diff -u -b -p -r1.120 signal.c
--- linux-2.6/kernel/signal.c 10 May 2004 20:28:20 -0000 1.120
+++ linux-2.6/kernel/signal.c 4 Jun 2004 01:16:31 -0000
@@ -161,6 +161,9 @@ static int sig_ignored(struct task_struc
{
void * handler;
+ if (t->flags & PF_DEAD)
+ return 1;
+
/*
* Tracers always want to know about signals..
*/
Thanks,
Roland
Roland McGrath <[email protected]> wrote:
>
> Is there a reproducer case around so we can test fixes for this problem?
Jeremy would have the details.
> It seems to me that signals sent to an already dying task might as well
> just be discarded anyway. All they ever do now (except for trip bugs) is
> change what pending signals you see in the /proc/pid/status entry for a
> zombie. What's wrong with this:
>
> Index: linux-2.6/kernel/signal.c
> ===================================================================
> RCS file: /home/roland/redhat/bkcvs/linux-2.5/kernel/signal.c,v
> retrieving revision 1.120
> diff -u -b -p -r1.120 signal.c
> --- linux-2.6/kernel/signal.c 10 May 2004 20:28:20 -0000 1.120
> +++ linux-2.6/kernel/signal.c 4 Jun 2004 01:16:31 -0000
> @@ -161,6 +161,9 @@ static int sig_ignored(struct task_struc
> {
> void * handler;
>
> + if (t->flags & PF_DEAD)
> + return 1;
> +
I'm not sure about the locking here. What happens if the task starts
exitting immediately after the above test?
Plus the above adds code to the delivery fastpath, rather than exit().
If we're going to do the above for other reasons (what are they?) then it
would be neater to add a new PF_NO_SIGNALS rather than overloading PF_DEAD.
Thanks to Rusty for the test case. It did reproduce the crash on my test
box using vanilla 2.6.6, so we can use it to verify differing fixes as needed.
> > --- linux-2.6/kernel/signal.c 10 May 2004 20:28:20 -0000 1.120
> > +++ linux-2.6/kernel/signal.c 4 Jun 2004 01:16:31 -0000
> > @@ -161,6 +161,9 @@ static int sig_ignored(struct task_struc
> > {
> > void * handler;
> >
> > + if (t->flags & PF_DEAD)
> > + return 1;
> > +
>
> I'm not sure about the locking here. What happens if the task starts
> exitting immediately after the above test?
This is effectively governed by the tasklist_lock. PF_DEAD is set with
tasklist_lock write-locked and irqs disabled. All the signal code that
calls sig_ignored runs with tasklist_lock read-locked. So whatever other
process doing this test would be blocking the exiting task at the start of
exit_notify.
Even if that were not so, this particular race concern may be moot if in
all the race situations the same aforementioned locking reality means that
any other thread would be excluded from fetching the task_struct pointer
from the list in the first place during the PF_DEAD window.
The concern I had is basically that this might not be true of all cases.
The only problem case that has come up is the current task's own interrupt
handlers calling signal code while interrupting release_task. I know for
the case of posix-timers it's not an issue because their cleanup is handled
with special synchronization in __exit_signal. What I'm not sure about is
all other sources of asynchronous signals that use task_struct pointers
rather than PID lookups and so might do one while release_task is in progress.
e.g. async IO signals triggered via driver interrupts, etc.
> Plus the above adds code to the delivery fastpath, rather than exit().
That is a good point. If there are in reality not other cases to worry
about, then this is reason enough to omit what's then just a sanity check.
If other cases do appear in the wild, then perhaps we'll be lucky and they
will be the obvious null pointer dereference we got this time. But in the
worst case, it could be the cached ->sighand or ->signal pointer to memory
that's already been reused in another thread for something else and is then
clobbered by the signaller, bringing endless joy.
I am fairly confident I understand the code for timer signals, child-death
signals, parent-death signals, and how it relates to the exit issues. But
for any other miscellaneous asynchronous signals sent by whatever else, I
do not know for sure whether such sources are synchronously detached from
tasks during exit so as to avoid this kind of race.
> If we're going to do the above for other reasons (what are they?) then it
> would be neater to add a new PF_NO_SIGNALS rather than overloading PF_DEAD.
I wouldn't call this "overloading". There isn't any separate "all signals
ignored by short-circuit" mode that we want to be able to enable (I don't,
anyway). The point of the test is to short-circuit checking ->sighand or
->signal when they might be null, i.e. after exit_notify unlocks. It could
as well check (p->state & (TASK_ZOMBIE|TASK_DEAD)), which is state checked
elsewhere in the signal code.
My patch above is actually useless. All calls to sig_ignored take place
with ->sighand->siglock held, so it's after the window where the timer can
produce the crash. If the check is needed, then it's needed between:
read_lock(&tasklist_lock);
and:
spin_lock_irqsave(&p->sighand->siglock, flags);
in places like send_sigqueue, send_group_sigqueue, send_sig_info,
group_send_sig_info. Given my unsureness about other kinds of asynchronous
signals, I would be most comfortable with at least a:
BUG_ON(p->flags & PF_DEAD);
after read_lock(&tasklist_lock); in those functions.
Back to the local fix for the case we have hit, the signals produced by
update_process_times. I think the patch below that went in has a bug.
Perhaps I've missed something. It seems to me that clearing it_virt_incr
doesn't really cut it, since there will be one more hit before that value
is reloaded. Shouldn't it clear it_virt_value, as it clears it_prof_value?
Index: linux-2.6/kernel/exit.c
===================================================================
RCS file: /home/roland/redhat/bkcvs/linux-2.5/kernel/exit.c,v
retrieving revision 1.135
retrieving revision 1.136
diff -u -b -p -r1.135 -r1.136
--- linux-2.6/kernel/exit.c 29 May 2004 18:15:01 -0000 1.135
+++ linux-2.6/kernel/exit.c 2 Jun 2004 14:18:12 -0000 1.136
@@ -737,6 +737,14 @@ static void exit_notify(struct task_stru
tsk->flags |= PF_DEAD;
/*
+ * Clear these here so that update_process_times() won't try to deliver
+ * itimer, profile or rlimit signals to this task while it is in late exit.
+ */
+ tsk->it_virt_incr = 0;
+ tsk->it_prof_value = 0;
+ tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY;
+
+ /*
* In the preemption case it must be impossible for the task
* to get runnable again, so use "_raw_" unlock to keep
* preempt_count elevated until we schedule().
And btw, as I think someone else mentioned, update_one_process really
should be made static so it can be inlined. It has just the one caller,
update_process_times in the same file (kernel/timer.c).
Thanks,
Roland
Roland McGrath <[email protected]> wrote:
>
> Back to the local fix for the case we have hit, the signals produced by
> update_process_times. I think the patch below that went in has a bug.
> Perhaps I've missed something. It seems to me that clearing it_virt_incr
> doesn't really cut it, since there will be one more hit before that value
> is reloaded. Shouldn't it clear it_virt_value, as it clears it_prof_value?
>
yup.
As Roland McGrath <[email protected]> points out, we need to zero
task->it_virt_value to prevent timer-based signal delivery, not
->it_virt_incr.
Signed-off-by: Andrew Morton <[email protected]>
---
25-akpm/kernel/exit.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)
diff -puN kernel/exit.c~exit-timer-interrupt-race-fix-fix kernel/exit.c
--- 25/kernel/exit.c~exit-timer-interrupt-race-fix-fix 2004-06-09 19:17:21.138846896 -0700
+++ 25-akpm/kernel/exit.c 2004-06-09 19:17:35.523660072 -0700
@@ -740,7 +740,7 @@ static void exit_notify(struct task_stru
* Clear these here so that update_process_times() won't try to deliver
* itimer, profile or rlimit signals to this task while it is in late exit.
*/
- tsk->it_virt_incr = 0;
+ tsk->it_virt_value = 0;
tsk->it_prof_value = 0;
tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY;
_
> And btw, as I think someone else mentioned, update_one_process really
> should be made static so it can be inlined. It has just the one caller,
> update_process_times in the same file (kernel/timer.c).
yup.
Signed-off-by: Andrew Morton <[email protected]>
---
25-akpm/include/linux/sched.h | 2 --
25-akpm/kernel/timer.c | 2 +-
2 files changed, 1 insertion(+), 3 deletions(-)
diff -puN include/linux/sched.h~staticalise-update_one_process include/linux/sched.h
--- 25/include/linux/sched.h~staticalise-update_one_process 2004-06-09 19:16:02.826752144 -0700
+++ 25-akpm/include/linux/sched.h 2004-06-09 19:16:23.919545552 -0700
@@ -176,8 +176,6 @@ long io_schedule_timeout(long timeout);
extern void cpu_init (void);
extern void trap_init(void);
extern void update_process_times(int user);
-extern void update_one_process(struct task_struct *p, unsigned long user,
- unsigned long system, int cpu);
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;
diff -puN kernel/timer.c~staticalise-update_one_process kernel/timer.c
--- 25/kernel/timer.c~staticalise-update_one_process 2004-06-09 19:16:02.855747736 -0700
+++ 25-akpm/kernel/timer.c 2004-06-09 19:16:30.734509520 -0700
@@ -830,7 +830,7 @@ static inline void do_it_prof(struct tas
}
}
-void update_one_process(struct task_struct *p, unsigned long user,
+static void update_one_process(struct task_struct *p, unsigned long user,
unsigned long system, int cpu)
{
do_process_times(p, user, system);
_
On Thu, 2004-06-10 at 11:48, Roland McGrath wrote:
> The concern I had is basically that this might not be true of all cases.
> The only problem case that has come up is the current task's own interrupt
> handlers calling signal code while interrupting release_task. I know for
> the case of posix-timers it's not an issue because their cleanup is handled
> with special synchronization in __exit_signal. What I'm not sure about is
> all other sources of asynchronous signals that use task_struct pointers
> rather than PID lookups and so might do one while release_task is in progress.
> e.g. async IO signals triggered via driver interrupts, etc.
Yes. In 2.4 we explicitly checked in the signal code.
Why don't we do the sane thing and just do release_task() from
__put_task_struct(), rather than the current two-stage thing?
Rusty.
--
Anyone who quotes me in their signature is an idiot -- Rusty Russell