2008-01-30 22:40:54

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Regression] 2.6.24-git8 (and earlier): Multiple processes stuck in D states after logout from KDE

Hi,

Recently I've been observing problems with unmounting the /home fs on reboot
and/or shutdown on two test boxes.

After some more investigation I've found that this is due to some KDE processes
stuck in D states after their owner has logged out.

This happens 100% of the time if there's a suspend/resume cycle before the user
logs out (ie. the user logs into KDE, works for some time, suspends the box to
RAM and resmes one or more times and then logs out). Still, I also observe the
symptoms on a box that's never suspended.

I'm not sure how to debug this, so please advise.

Thanks,
Rafael


2008-01-30 23:44:42

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Regression] 2.6.24-git8 (and earlier): Multiple processes stuck in D states after logout from KDE

Update.

On Wednesday, 30 of January 2008, Rafael J. Wysocki wrote:
> Hi,
>
> Recently I've been observing problems with unmounting the /home fs on reboot
> and/or shutdown on two test boxes.
>
> After some more investigation I've found that this is due to some KDE processes
> stuck in D states after their owner has logged out.
>
> This happens 100% of the time if there's a suspend/resume cycle before the user
> logs out (ie. the user logs into KDE, works for some time, suspends the box to
> RAM and resmes one or more times and then logs out). Still, I also observe the
> symptoms on a box that's never suspended.
>
> I'm not sure how to debug this, so please advise.

After reverting:

commit 37bb6cb4097e29ffee970065b74499cbf10603a3
Author: Peter Zijlstra <[email protected]>
Date: ? Fri Jan 25 21:08:32 2008 +0100

? ? hrtimer: unlock hrtimer_wakeup

I no longer get processes in the D state, but there still is a problem with
artswrapper (this is an openSUSE 10.3 system, x86-64). Namely,
after a suspend/resume cycle and logging out/logging in the user,
artswrapper gets stuck somewhere, apparently in the running (R) state.
For this reason it blocks any subsequent attempts to suspend.

Here's the relevant trace (from show_state()):

[ 522.474919] artswrapper R running task 0 4805 1
[ 522.474922] ffff810074cd1f70 0000000000000082 0000000000000296 ffff810074cd1ed8
[ 522.474926] ffffffff80311769 ffff810074cd1f20 ffffffff80701240 ffffffff80701240
[ 522.474930] ffffffff80701240 ffffffff80701240 ffffffff80701240 ffffffff80701240
[ 522.474933] Call Trace:
[ 522.474940] [<ffffffff80311769>] ? __up_read+0x8f/0x97
[ 522.474963] [<ffffffff8020c5cf>] retint_careful+0xd/0x21

where, according to gdb,

(gdb) l *__up_read+0x8f
0xffffffff80311769 is in __up_read (/home/rafael/src/linux-2.6/lib/rwsem-spinlock.c:273).
268
269 if (--sem->activity == 0 && !list_empty(&sem->wait_list))
270 sem = __rwsem_wake_one_writer(sem);
271
272 spin_unlock_irqrestore(&sem->wait_lock, flags);
273 }
274
275 /*
276 * release a write lock on the semaphore
277 */

What gives?

Rafael

2008-01-31 14:27:54

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] hrtimer: fix hrtimer_init_sleeper() users


On Thu, 2008-01-31 at 00:42 +0100, Rafael J. Wysocki wrote:
> Update.
>
> On Wednesday, 30 of January 2008, Rafael J. Wysocki wrote:
> > Hi,
> >
> > Recently I've been observing problems with unmounting the /home fs on reboot
> > and/or shutdown on two test boxes.
> >
> > After some more investigation I've found that this is due to some KDE processes
> > stuck in D states after their owner has logged out.
> >
> > This happens 100% of the time if there's a suspend/resume cycle before the user
> > logs out (ie. the user logs into KDE, works for some time, suspends the box to
> > RAM and resmes one or more times and then logs out). Still, I also observe the
> > symptoms on a box that's never suspended.
> >
> > I'm not sure how to debug this, so please advise.
>
> After reverting:
>
> commit 37bb6cb4097e29ffee970065b74499cbf10603a3
> Author: Peter Zijlstra <[email protected]>
> Date: Fri Jan 25 21:08:32 2008 +0100
>
> hrtimer: unlock hrtimer_wakeup
>
> I no longer get processes in the D state, but there still is a problem with
> artswrapper (this is an openSUSE 10.3 system, x86-64). Namely,
> after a suspend/resume cycle and logging out/logging in the user,
> artswrapper gets stuck somewhere, apparently in the running (R) state.
> For this reason it blocks any subsequent attempts to suspend.
>
> Here's the relevant trace (from show_state()):
>
> [ 522.474919] artswrapper R running task 0 4805 1
> [ 522.474922] ffff810074cd1f70 0000000000000082 0000000000000296 ffff810074cd1ed8
> [ 522.474926] ffffffff80311769 ffff810074cd1f20 ffffffff80701240 ffffffff80701240
> [ 522.474930] ffffffff80701240 ffffffff80701240 ffffffff80701240 ffffffff80701240
> [ 522.474933] Call Trace:
> [ 522.474940] [<ffffffff80311769>] ? __up_read+0x8f/0x97
> [ 522.474963] [<ffffffff8020c5cf>] retint_careful+0xd/0x21
>
> where, according to gdb,
>
> (gdb) l *__up_read+0x8f
> 0xffffffff80311769 is in __up_read (/home/rafael/src/linux-2.6/lib/rwsem-spinlock.c:273).
> 268
> 269 if (--sem->activity == 0 && !list_empty(&sem->wait_list))
> 270 sem = __rwsem_wake_one_writer(sem);
> 271
> 272 spin_unlock_irqrestore(&sem->wait_lock, flags);
> 273 }
> 274
> 275 /*
> 276 * release a write lock on the semaphore
> 277 */
>
> What gives?

Well, let me give you something that worked for Guillaume :-)

---
Subject: hrtimer: fix hrtimer_init_sleeper() users

commit 37bb6cb4097e29ffee970065b74499cbf10603a3
Author: Peter Zijlstra <[email protected]>
Date: Fri Jan 25 21:08:32 2008 +0100

hrtimer: unlock hrtimer_wakeup

Broke hrtimer_init_sleeper() users. It forgot to fix up the futex
caller of this function to detect the failed queueing and messed up
the do_nanosleep() caller in that it could leak a TASK_INTERRUPTIBLE
state.

Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/futex.c | 2 ++
kernel/hrtimer.c | 2 ++
2 files changed, 4 insertions(+)

Index: linux-2.6/kernel/hrtimer.c
===================================================================
--- linux-2.6.orig/kernel/hrtimer.c
+++ linux-2.6/kernel/hrtimer.c
@@ -1312,6 +1312,8 @@ static int __sched do_nanosleep(struct h

} while (t->task && !signal_pending(current));

+ __set_current_state(TASK_RUNNING);
+
return t->task == NULL;
}

Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -1252,6 +1252,8 @@ static int futex_wait(u32 __user *uaddr,
t.timer.expires = *abs_time;

hrtimer_start(&t.timer, t.timer.expires, HRTIMER_MODE_ABS);
+ if (!hrtimer_active(&t->timer))
+ t->task = NULL;

/*
* the timer could have already expired, in which

2008-01-31 15:43:40

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH -v2] hrtimer: fix hrtimer_init_sleeper() users

this time build tested

---
Subject: hrtimer: fix hrtimer_init_sleeper() users

commit 37bb6cb4097e29ffee970065b74499cbf10603a3
Author: Peter Zijlstra <[email protected]>
Date: Fri Jan 25 21:08:32 2008 +0100

hrtimer: unlock hrtimer_wakeup

Broke hrtimer_init_sleeper() users. It forgot to fix up the futex
caller of this function to detect the failed queueing and messed up
the do_nanosleep() caller in that it could leak a TASK_INTERRUPTIBLE
state.

Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/futex.c | 2 ++
kernel/hrtimer.c | 2 ++
2 files changed, 4 insertions(+)

Index: linux-2.6/kernel/hrtimer.c
===================================================================
--- linux-2.6.orig/kernel/hrtimer.c
+++ linux-2.6/kernel/hrtimer.c
@@ -1312,6 +1312,8 @@ static int __sched do_nanosleep(struct h

} while (t->task && !signal_pending(current));

+ __set_current_state(TASK_RUNNING);
+
return t->task == NULL;
}

Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -1252,6 +1252,8 @@ static int futex_wait(u32 __user *uaddr,
t.timer.expires = *abs_time;

hrtimer_start(&t.timer, t.timer.expires, HRTIMER_MODE_ABS);
+ if (!hrtimer_active(&t.timer))
+ t.task = NULL;

/*
* the timer could have already expired, in which

2008-01-31 19:12:22

by Alessandro Suardi

[permalink] [raw]
Subject: Re: [Regression] 2.6.24-git8 (and earlier): Multiple processes stuck in D states after logout from KDE

2008/1/31 Rafael J. Wysocki <[email protected]>:
> Update.
>
> On Wednesday, 30 of January 2008, Rafael J. Wysocki wrote:
> > Hi,
> >
> > Recently I've been observing problems with unmounting the /home fs on reboot
> > and/or shutdown on two test boxes.
> >
> > After some more investigation I've found that this is due to some KDE processes
> > stuck in D states after their owner has logged out.
> >
> > This happens 100% of the time if there's a suspend/resume cycle before the user
> > logs out (ie. the user logs into KDE, works for some time, suspends the box to
> > RAM and resmes one or more times and then logs out). Still, I also observe the
> > symptoms on a box that's never suspended.
> >
> > I'm not sure how to debug this, so please advise.
>
> After reverting:
>
> commit 37bb6cb4097e29ffee970065b74499cbf10603a3
> Author: Peter Zijlstra <[email protected]>
> Date: Fri Jan 25 21:08:32 2008 +0100
>
> hrtimer: unlock hrtimer_wakeup
>
> I no longer get processes in the D state, but there still is a problem with
> artswrapper (this is an openSUSE 10.3 system, x86-64). Namely,
> after a suspend/resume cycle and logging out/logging in the user,
> artswrapper gets stuck somewhere, apparently in the running (R) state.
> For this reason it blocks any subsequent attempts to suspend.
>
> Here's the relevant trace (from show_state()):
>
> [ 522.474919] artswrapper R running task 0 4805 1
> [ 522.474922] ffff810074cd1f70 0000000000000082 0000000000000296 ffff810074cd1ed8
> [ 522.474926] ffffffff80311769 ffff810074cd1f20 ffffffff80701240 ffffffff80701240
> [ 522.474930] ffffffff80701240 ffffffff80701240 ffffffff80701240 ffffffff80701240
> [ 522.474933] Call Trace:
> [ 522.474940] [<ffffffff80311769>] ? __up_read+0x8f/0x97
> [ 522.474963] [<ffffffff8020c5cf>] retint_careful+0xd/0x21
>
> where, according to gdb,
>
> (gdb) l *__up_read+0x8f
> 0xffffffff80311769 is in __up_read (/home/rafael/src/linux-2.6/lib/rwsem-spinlock.c:273).
> 268
> 269 if (--sem->activity == 0 && !list_empty(&sem->wait_list))
> 270 sem = __rwsem_wake_one_writer(sem);
> 271
> 272 spin_unlock_irqrestore(&sem->wait_lock, flags);
> 273 }
> 274
> 275 /*
> 276 * release a write lock on the semaphore
> 277 */
>
> What gives?

In recent kernels, I had a hard hang in -git6 on my Fedora8-based
Dell D610 (x86, UP) - and since at least -git4 I can't shutdown my
Oracle 11.1 instance, with the VKTM thread remaining in R state,
while taking up no CPU at all:

[root@sandman ~]# ps ax | grep vktm
3211 ? Rs 0:00 ora_vktm_t111

additionally, VKTM is

- non-straceable (strace hangs)
- non-gdb'able (gdb hangs)
- non-pstack'able (pstack returns empty)
- non-killable (kill -9 doesn't kill it)

echo'ing 't' in /proc/sysrq-trigger, I have (2.6.24-git8):

Jan 31 20:04:31 sandman kernel: =======================
Jan 31 20:04:31 sandman kernel: oracle R running 2668 3211 1
Jan 31 20:04:31 sandman kernel: f1462fb0 00000082 f1f1aa54
f172dd80 c01055f6 00000000 0ef609ac bf8532ec
Jan 31 20:04:31 sandman kernel: 00000000 f1462000 c0103e26
0ef609ac bf853388 b79906a4 bf8532ec 00000000
Jan 31 20:04:31 sandman kernel: bf850140 bf850068 0000007b
0000007b c0320000 ffffffff 08f53d02 00000073
Jan 31 20:04:31 sandman kernel: Call Trace:
Jan 31 20:04:31 sandman kernel: [<c01055f6>] ? do_IRQ+0xac/0xc1
Jan 31 20:04:31 sandman kernel: [<c0103e26>] work_resched+0x5/0x16
Jan 31 20:04:31 sandman kernel: [<c0320000>] ? tg3_init_one+0xe76/0x10e0
Jan 31 20:04:31 sandman kernel: =======================

Once I'm done with a largish FTP I'll try and find out where
this exactly began.

--alessandro

"We act as though comfort and luxury were the chief requirements
of life, when all that we need to make us really happy is
something to be enthusiastic about."

(Charles Kingsley)