2005-02-21 19:28:42

by Anthony DiSante

[permalink] [raw]
Subject: uninterruptible sleep lockups

Processes that get permanently stuck in "uninterruptible sleep" (the D state
as indicated by "ps aux") are such a pain. Of course they've always
existed, but at least on the 3 systems that I administer, they are far more
frequent with udev than they ever were before. I'm constantly upgrading
udev, hal, etc on these 3 different systems, but still not a week goes by
that one of them doesn't need a reboot because some hardware-related process
is hung.

The most recent one was yesterday: I had run lsusb in the morning and had no
problems, but at the end of the day I ran it again, and after outputting 3
lines of data, it hung, stuck in D-state. So now I have this:

[/home/user]$ ps aux|grep D
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 92 0.0 0.0 0 0 ? D Feb19 0:00 [khubd]
root 845 0.0 0.0 0 0 ? D Feb19 0:00 [knodemgrd_0]
root 29016 0.0 0.1 1512 592 ? D 00:28 0:00 lsusb

It seems like this problem is always going to exist, because some hardware
and some drivers will always be buggy. So shouldn't we have some sort of
watchdog higher up in the kernel, that watches for hung processes like this
and kills them?

Don't get me wrong, I love rebooting every couple days... but I have a
Windows system for that.

-Anthony DiSante
http://nodivisions.com/


2005-02-21 19:46:37

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Mon, 21 Feb 2005 14:18:44 EST, Anthony DiSante said:

> It seems like this problem is always going to exist, because some hardware
> and some drivers will always be buggy. So shouldn't we have some sort of
> watchdog higher up in the kernel, that watches for hung processes like this
> and kills them?

And said watchdog would clean up the mess, how, exactly? There's lots of sticky
issues having to do with breaking locks and possibly still-pending I/O (I once had
a tape drive complete an I/O 3 *days* after the request was sent - good thing no
watchdog killed the process and deallocated the memory that I/O landed in ;)

It's been covered before, look in the lkml archives for details.


Attachments:
(No filename) (226.00 B)

2005-02-21 20:26:36

by Anthony DiSante

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

[email protected] wrote:
>>It seems like this problem is always going to exist, because some hardware
>>and some drivers will always be buggy. So shouldn't we have some sort of
>>watchdog higher up in the kernel, that watches for hung processes like this
>>and kills them?
>
> And said watchdog would clean up the mess, how, exactly? There's lots of sticky
> issues having to do with breaking locks and possibly still-pending I/O (I once had
> a tape drive complete an I/O 3 *days* after the request was sent - good thing no
> watchdog killed the process and deallocated the memory that I/O landed in ;)

I'm not a kernel programmer, so I don't have the answers to any of that. I
guess I was thinking that there'd be some way to distinguish between
processes that are truly stuck -- that is, never coming back -- and
processes like yours, that are taking a long time but still working.

Or maybe it SHOULD have killed your process, in some "proper" way that
prevents any outstanding I/O requests from coming in days later and breaking
things. Again, I'm no kernel hacker, but if an I/O request takes *3 days*,
isn't that an indication of a bug or of faulty hardware perhaps?

> It's been covered before, look in the lkml archives for details.

Thanks, I'll do that. But could you give me a more specific pointer?
Searching lkml for "uninterruptible" returns ~2000 results.

Thanks,
Anthony DiSante
http://nodivisions.com/

2005-02-21 20:54:14

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Mon, 21 Feb 2005 15:24:21 EST, Anthony DiSante said:
> Or maybe it SHOULD have killed your process, in some "proper" way that
> prevents any outstanding I/O requests from coming in days later and breaking
> things. Again, I'm no kernel hacker, but if an I/O request takes *3 days*,
> isn't that an indication of a bug or of faulty hardware perhaps?

Right. And if you're an automated program trying to clean up after a *bug*,
what do you do? It's quite likely that somebody's borked a lock - in which
case it may even be that the "hung" process is the victim rather than the
culprit, and breaking the lock will just make things worse. Similar issues
apply to *all* of the resources the wedged process has attached to it.

When these things get posted on lkml, it almost always involves quite a bit
of code introspection and scratching of heads before we figure out how the
system *got* its figurative head wedged into that crevice. Until you
figure out how it *got* there, a safe cleanup is in general impossible. And
we haven't seen yet the automatic program that can introspect the code to that
detail - even the Stanford automated checker and sparse and the like are quite the
impressive pieces of work.

> > It's been covered before, look in the lkml archives for details.
>
> Thanks, I'll do that. But could you give me a more specific pointer?

See the thread rooted here:

Date: Wed, 03 Nov 2004 07:51:39 -0500
From: Gene Heskett <[email protected]>
Subject: is killing zombies possible w/o a reboot?
Sender: [email protected]
To: [email protected]
Reply-to: [email protected]
Message-id: <[email protected]>


Attachments:
(No filename) (226.00 B)

2005-02-21 22:20:21

by Anthony DiSante

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

[email protected] wrote:
> See the thread rooted here:
>
> Date: Wed, 03 Nov 2004 07:51:39 -0500
> From: Gene Heskett <[email protected]>
> Subject: is killing zombies possible w/o a reboot?
> Sender: [email protected]
> To: [email protected]
> Reply-to: [email protected]
> Message-id: <[email protected]>

OK, there are two different opinions expressed at various places in that
thread: they are that automatically killing processes hung in D state would
be either 1) difficult/nonideal, or 2) impossible.

If it's truly impossible, then that settles it.

But if it's just difficult/nonideal, then here are my thoughts. Again
referencing that thread, there are a bunch of comments saying "that's an NFS
bug, fix the bug" and "that's a samba bug, fix the bug" and "that's a driver
bug, fix the driver."

It's indisputable that there will always be driver bugs and faulty hardware.
Of course these should be fixed, but if it's possible for the kernel to
gracefully deal with the bugs until they get fixed, then why shouldn't it do
so? I understand the goal of making the common (non-buggy) case fast, but
in my experience (and I can't be the only one) buggy hardware/drivers are
becoming more and more common, and with the computer industry getting
ever-bigger and people doing ever-more with their computers, this trend will
only continue (the more hardware on the market the more bugs there will be).

As I stated in my original post, on the 3 different systems I administer, I
need to reboot ~weekly because of the permanent D state. These 3 systems
are completely different, and the processes that hang are different --
digital camera software/drivers, a CDROM, and a printer are among the
sources that have recently caused the permanent D state. Maybe the
non-buggy case is the most common one, but the buggy case is certainly not
UNcommon. If it's possible to wipe out this whole class of problem with
some (admittedly difficult) extra work in the kernel, then I don't see how
that isn't preferable to guaranteeing that people will always need to reboot
their linux systems when they get new hardware that puts processes into the
D state permanently.

-Anthony DiSante
http://nodivisions.com/

2005-02-21 22:43:58

by Chris Friesen

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

Anthony DiSante wrote:

> It's indisputable that there will always be driver bugs and faulty
> hardware. Of course these should be fixed, but if it's possible for the
> kernel to gracefully deal with the bugs until they get fixed, then why
> shouldn't it do so?

Think of the overhead required to track every single resource ever
aquired by the process/thread/entity in question. Then if/when it
hangs, you'd have to properly clean up every last one of them.

Much safer/simpler to leave it hung, and force an eventual reboot.

If you have been given code that causes D states, bitch to the supplier
until they fix it. Kernel bugs are not acceptable.

Chris

2005-02-21 22:46:37

by Anthony DiSante

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

[email protected] wrote:
> See the thread rooted here:
>
> Date: Wed, 03 Nov 2004 07:51:39 -0500
> From: Gene Heskett <[email protected]>
> Subject: is killing zombies possible w/o a reboot?
> Sender: [email protected]
> To: [email protected]
> Reply-to: [email protected]
> Message-id: <[email protected]>

Also, one of the things mentioned in that thread is that whenever a driver
is waiting on I/O from a piece of hardware, there should always be some
timeout code. Is that the root of the permanent D state? Is it always a
process waiting on a piece of hardware that should be eventually timing out,
except the timeout code isn't there?

-Anthony DiSante
http://nodivisions.com/

2005-02-21 23:11:39

by Nish Aravamudan

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Mon, 21 Feb 2005 17:44:32 -0500, Anthony DiSante
<[email protected]> wrote:
> [email protected] wrote:
> > See the thread rooted here:
> >
> > Date: Wed, 03 Nov 2004 07:51:39 -0500
> > From: Gene Heskett <[email protected]>
> > Subject: is killing zombies possible w/o a reboot?
> > Sender: [email protected]
> > To: [email protected]
> > Reply-to: [email protected]
> > Message-id: <[email protected]>
>
> Also, one of the things mentioned in that thread is that whenever a driver
> is waiting on I/O from a piece of hardware, there should always be some
> timeout code. Is that the root of the permanent D state? Is it always a
> process waiting on a piece of hardware that should be eventually timing out,
> except the timeout code isn't there?

If you would like to file a bugzilla bug (or reference one if you
already have) -- http://bugzilla.kernel.org -- it would be easier to
track the problems. It would be good to get some idea of what hardware
is running (and thus what drivers) to debug further.

Thanks,
Nish

2005-02-22 00:08:31

by Anthony DiSante

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

Chris Friesen wrote:
>> It's indisputable that there will always be driver bugs and faulty
>> hardware. Of course these should be fixed, but if it's possible for
>> the kernel to gracefully deal with the bugs until they get fixed, then
>> why shouldn't it do so?
>
> Think of the overhead required to track every single resource ever
> aquired by the process/thread/entity in question. Then if/when it
> hangs, you'd have to properly clean up every last one of them.

Yes, that would be difficult and expensive. But if permanently-D-stated
processes happened monthly on 50% of systems, then wouldn't it be worth it?
How about weekly on 10% of systems? The point is that at some point this
becomes worth considering, and with more people adding more new hardware to
their systems every day, this problem is becoming more and more frequent.

> Much safer/simpler to leave it hung, and force an eventual reboot.

"Eventual" makes it sound far away, but the reality is that if part of your
USB subsystem is D-stated, then "eventually" means next time you want to use
your USB stick, or your printer, or your digital camera, or your MP3 device,
or... In other words, "eventually" means "right now, interrupting all your
current work."

> If you have been given code that causes D states, bitch to the supplier
> until they fix it.

The driver code for my devices has "been given" to me as part of the kernel.
Any of a handful of USB devices has caused permanent D states, as has a
CDROM and a NIC. I guess I'll need to start debugging all of these drivers.
When something goes into permanent D sleep, what should I do to start
tracking down the problem? Aside from obvious stuff like dmesg and checking
/var/log/messages, neither of which ever seems to say anything useful when
this happens.

> Kernel bugs are not acceptable.

That's a nice-sounding ideal, but the truth is that kernel bugs exist and
are not uncommon.

-Anthony DiSante
http://nodivisions.com/

2005-02-22 00:36:28

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Mon, 21 Feb 2005 19:06:23 EST, Anthony DiSante said:

> The driver code for my devices has "been given" to me as part of the kernel.
> Any of a handful of USB devices has caused permanent D states, as has a
> CDROM and a NIC. I guess I'll need to start debugging all of these drivers.
> When something goes into permanent D sleep, what should I do to start
> tracking down the problem? Aside from obvious stuff like dmesg and checking
> /var/log/messages, neither of which ever seems to say anything useful when
> this happens.

Alt-Sysrq-T and provide the tracebacks for the wedged process(es). That,
and the other info suggested in the linux/REPORTING-BUGS file will go a long
way to actually getting things fixed.

> > Kernel bugs are not acceptable.
>
> That's a nice-sounding ideal, but the truth is that kernel bugs exist and
> are not uncommon.

Yes, but how do you write a "unwedge the hung process" daemon, given that said
daemon needs to know what the bug the process hit was in order to properly
unwedge it (at which point it's easier just to *fix* the frikking bug), and
also given that said unwedger will itself have bugs.

If you need further convincing, look at the rock-solid OOM-killer code, which has
a lot of the same issues as a zombie-unwedger - and all *it* has to do is deliver
a 'kill -9' to the right process. It doesn't have to unsnarl memory allocations
and locks and semaphores and PCI resources and all the rest....


Attachments:
(No filename) (226.00 B)

2005-02-22 11:18:58

by Anthony DiSante

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

Helge Hafting wrote:
> The infrastructure for that does not exist, so instead, the "killed"
> process remains. Not all of it, but at least the memory pinned down by
> the io request. This overhead is typically small, and the overehad of
> adding forced io abort to every driver might
> be larger than a handful of stuck processes. It looks ugly, but perhaps
> a ps flag that hides the ugly processes is enough.

I don't care about any overhead associated with stuck processes, nor do I
care that they look ugly in the ps output. What I care about is the fact
that at least once a week on multiple systems with different hardware, some
HW-related driver/process gets stuck, then immediately cascades its
stuckness up to udevd or hald, and then I can't use any of my hardware
anymore until I reboot.

-Anthony DiSante
http://nodivisions.com/

2005-02-22 12:28:45

by Denis Vlasenko

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Tuesday 22 February 2005 13:16, Anthony DiSante wrote:
> Helge Hafting wrote:
> > The infrastructure for that does not exist, so instead, the "killed"
> > process remains. Not all of it, but at least the memory pinned down by
> > the io request. This overhead is typically small, and the overehad of
> > adding forced io abort to every driver might
> > be larger than a handful of stuck processes. It looks ugly, but perhaps
> > a ps flag that hides the ugly processes is enough.
>
> I don't care about any overhead associated with stuck processes, nor do I
> care that they look ugly in the ps output. What I care about is the fact
> that at least once a week on multiple systems with different hardware, some
> HW-related driver/process gets stuck, then immediately cascades its
> stuckness up to udevd or hald, and then I can't use any of my hardware
> anymore until I reboot.

This was discussed to death before. There will never be a "D-state" killer. Period.

If you want to get rid of your stuck processes, you need to fix the bug
or at least let lkml people know about it (this was already explained to you!).
--
vda

2005-02-22 12:38:17

by Anthony DiSante

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

Denis Vlasenko wrote:
>>>The infrastructure for that does not exist, so instead, the "killed"
>>>process remains. Not all of it, but at least the memory pinned down by
>>>the io request. This overhead is typically small, and the overehad of
>>>adding forced io abort to every driver might
>>>be larger than a handful of stuck processes. It looks ugly, but perhaps
>>>a ps flag that hides the ugly processes is enough.
>>
>>I don't care about any overhead associated with stuck processes, nor do I
>>care that they look ugly in the ps output. What I care about is the fact
>>that at least once a week on multiple systems with different hardware, some
>>HW-related driver/process gets stuck, then immediately cascades its
>>stuckness up to udevd or hald, and then I can't use any of my hardware
>>anymore until I reboot.
>
>
> This was discussed to death before. There will never be a "D-state" killer. Period.
>
> If you want to get rid of your stuck processes, you need to fix the bug
> or at least let lkml people know about it (this was already explained to you!).

I didn't mention any of that here; my reply was simply to correct Helge's
misunderstanding about why I dislike stuck processes. Regardless of whether
the bugs get fixed or the kernel finds a way to work around them, my dislike
has nothing to do with the overhead or "ugliness" of stuck processes; I
dislike them because they render my system useless for 75% of the things I
use it for.

-Anthony DiSante
http://nodivisions.com/

2005-02-22 13:49:02

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Tue, 22 Feb 2005, Anthony DiSante wrote:

> Helge Hafting wrote:
>> The infrastructure for that does not exist, so instead, the "killed"
>> process remains. Not all of it, but at least the memory pinned down by the
>> io request. This overhead is typically small, and the overehad of adding
>> forced io abort to every driver might
>> be larger than a handful of stuck processes. It looks ugly, but perhaps a
>> ps flag that hides the ugly processes is enough.
>
> I don't care about any overhead associated with stuck processes, nor do I
> care that they look ugly in the ps output. What I care about is the fact
> that at least once a week on multiple systems with different hardware, some
> HW-related driver/process gets stuck, then immediately cascades its stuckness
> up to udevd or hald, and then I can't use any of my hardware anymore until I
> reboot.
>
> -Anthony DiSante
> http://nodivisions.com/
> -

You don't seem to understand. A process that's stuck in 'D' state
shows a SEVERE error, usually with a hardware driver. For instance,
somebody may have coded something in a critical section that will
wait forever for some bit to be set when, in fact, that bit may
never be set because of a hardware glitch. Such problems must
be found. One can't just suck some process out of the 'D' state.

So, you need to tell what driver was doing what. If you can't
then you need to provide enough information so that developers
may guess. For instance, if you get a process stuck in the 'D'
state when you use a CD/ROM, but not otherwise when you use
IDE or SCSI or whatever.., then you have a good guess that
there is some "wait-forever" code in the CD/ROM driver.

So, lets suppose that you had a problem with your CD/ROM.
You could eject it by hand and see if the process that
was stuck is no longer stuck, or you might be able to
power it OFF then ON. If this got a process "unstuck"
it might give the CD/ROM driver developer a hint as
to where to look in his code. No code is ever supposed
to wait forever for some hardware, but there are some
possibilities (races and whatever), that can effectively
wait forever. These possibilities need to be discovered
and fixed.

The 'D' state usually stands for 'Down' where a task
was 'down()' on a semaphore. To get out of that state,
that task (and none other) needs to execute `up()`.
This means that whatever that task was waiting for
needs to happen or it won't call 'up()'. The nature
of these mutexes requires that the thread that
acquired the semaphore be the same thread that
released it, otherwise we don't have a MUTEX.
So, there is no way that "somebody else" can
"fix" the task thread waiting with the MUTEX held.

There has been some discussion that these hung
states could be "fixed", but that's absolutely
positively incorrect. If you have a MUTEX that
"times out" or is otherwise breakable, you can't
use it to provide a single execution path to
a shared resource which is what these things
are used for in the first place.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.10 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-02-22 20:05:44

by Anthony DiSante

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

linux-os wrote:
> There has been some discussion that these hung
> states could be "fixed", but that's absolutely
> positively incorrect.

That's one of the things I asked a few messages ago. Some people on the
list were saying that it'd be "really hard" and would "require a lot of
bookkeeping" to "fix" permanently-D-stated processes... which is completely
different than "impossible."

-Anthony DiSante
http://nodivisions.com/

2005-02-22 20:16:18

by Chris Friesen

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

Anthony DiSante wrote:
> linux-os wrote:
>
>> There has been some discussion that these hung
>> states could be "fixed", but that's absolutely
>> positively incorrect.
>
>
> That's one of the things I asked a few messages ago. Some people on the
> list were saying that it'd be "really hard" and would "require a lot of
> bookkeeping" to "fix" permanently-D-stated processes... which is
> completely different than "impossible."

Nothing is "impossible". Cracking SHA-256 isn't "impossible", it just
takes more computing power than exists on the face of the planet.

Call it "infeasable" if you like. It's theoretically possible, but the
amount of work and the overhead involved just are not realistic. And
then you have the likelihood of a bug in the bookkeeping code leading to
runtime corruption... Better to take the hit now and fix the original
problem.

Chris

2005-02-22 20:25:14

by Horst H. von Brand

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

Anthony DiSante <[email protected]> said:
> linux-os wrote:
> > There has been some discussion that these hung
> > states could be "fixed", but that's absolutely
> > positively incorrect.

> That's one of the things I asked a few messages ago. Some people on the
> list were saying that it'd be "really hard" and would "require a lot of
> bookkeeping" to "fix" permanently-D-stated processes... which is completely
> different than "impossible."

Most people here have little clue. It can't be done.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2005-02-22 20:32:24

by Anthony DiSante

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

Chris Friesen wrote:
>>> There has been some discussion that these hung
>>> states could be "fixed", but that's absolutely
>>> positively incorrect.
>>
>> That's one of the things I asked a few messages ago. Some people on
>> the list were saying that it'd be "really hard" and would "require a
>> lot of bookkeeping" to "fix" permanently-D-stated processes... which
>> is completely different than "impossible."
>
>
> Nothing is "impossible".

Maybe where you live, but in my world some things are most certainly
impossible. Getting a 1MHz CPU to run at 1THz is impossible. Having the
kernel automatically install the latest firmware for a buggy device, without
actually having the new firmware file, is impossible. Turning your 17" LCD
monitor into a 50" HDTV is impossible.

> Cracking SHA-256 isn't "impossible", it just
> takes more computing power than exists on the face of the planet.

Thanks for proving my point. That's a perfect example of the difference
between "hard" and "impossible."

> Call it "infeasable" if you like. It's theoretically possible, but the
> amount of work and the overhead involved just are not realistic.

Again, that was one of my earlier questions, since some people here were
saying "impossible" while other were saying "really hard."

-Anthony DiSante
http://nodivisions.com/

2005-02-22 20:56:55

by Chris Friesen

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

Horst von Brand wrote:
> Anthony DiSante <[email protected]> said:

>>That's one of the things I asked a few messages ago. Some people on the
>>list were saying that it'd be "really hard" and would "require a lot of
>>bookkeeping" to "fix" permanently-D-stated processes... which is completely
>>different than "impossible."
>
> Most people here have little clue. It can't be done.

I realize it would be extremely difficult if not impossible to do in the
current linux architecture, but I find it hard to believe that it is
technically impossible if one were allowed to design the system from
scratch.

Maybe I'm on crack, but would it not be technically possible to have all
resource usage be tracked so that when a task tries to do something and
hangs, eventually it gets cleaned up?

We already handle cleaning up stuff for userspace (memory, file
descriptors, sockets, etc.). Why not enforce a design that says "all
entities taking a lock must specify a maximum hold time". After that
time expires, they are assumed to be hung, and all their resources
(which were being tracked by some system) get cleaned up.

It would probably be complicated, slow, and generally not worth the
effort. But it seems at least technically possible.

Chris

2005-02-22 21:42:18

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Tue, 22 Feb 2005, Chris Friesen wrote:

> Horst von Brand wrote:
>> Anthony DiSante <[email protected]> said:
>
>>> That's one of the things I asked a few messages ago. Some people on the
>>> list were saying that it'd be "really hard" and would "require a lot of
>>> bookkeeping" to "fix" permanently-D-stated processes... which is
>>> completely different than "impossible."
>>
>> Most people here have little clue. It can't be done.
>
> I realize it would be extremely difficult if not impossible to do in the
> current linux architecture, but I find it hard to believe that it is
> technically impossible if one were allowed to design the system from scratch.
>

No. It has nothing to do with the architecture. These problems go
all the way back to the first multi-tasking systems.

> Maybe I'm on crack, but would it not be technically possible to have all
> resource usage be tracked so that when a task tries to do something and
> hangs, eventually it gets cleaned up?

It's not the "task" that's hung. That's just the symptoms. It's
the SHARED RESOURCE that is hung! Once some task attempts
to use a shared resource, it must (somehow) get in-line so
that the it can use that task without somebody else coming
along and mucking with it. To get "in-line" means to
execute some kind of MUTEX. There are many kinds. VAXen
had a "lock manager", there are simple sleeping-loops using
semaphores, etc. Linux uses such loops, the two most
commonly used are "down()" and "up()".

Now, somebody needs a resource. It executes down(&semaphore);
once it gets control again, it has that resource. It attempts
to use that resource through a driver. The driver waits forever.
The resource is now permanently dorked --forever because its
driver is waiting forever. The user code never returns from
the driver so it can never execute up(&semaphore).

If you, somehow, grab hold of the program-counter (like
a long-jump), and force a return so that up(&semaphore) gets
executed, the wrong thread unlocks the semaphore but its
forever broken anyway because the resource it protects is
hung.

>
> We already handle cleaning up stuff for userspace (memory, file descriptors,
> sockets, etc.). Why not enforce a design that says "all entities taking a
> lock must specify a maximum hold time". After that time expires, they are
> assumed to be hung, and all their resources (which were being tracked by some
> system) get cleaned up.
>

Time won't do it. It's not a matter of "cleaning up" it's a matter
of not waiting forever in the first place. If you were able to
"clean up" by reinitializing the semaphores, etc., killing anything
that was attached, etc., the next instance of attempting to use
that resource will hang the exact same way.

We are not talking about some broken semaphore code that sometimes
gets hung. We are talking about the resource it protects. The
semaphore code is fine.

> It would probably be complicated, slow, and generally not worth the effort.
> But it seems at least technically possible.
>
> Chris
> -

The problem is not "Waiting in D state". That's the symptom.
The problem is waiting forever after a lock has been taken.
That is the problem.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.10 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-02-22 23:19:16

by Chris Friesen

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

linux-os wrote:

> Now, somebody needs a resource. It executes down(&semaphore);
> once it gets control again, it has that resource. It attempts
> to use that resource through a driver. The driver waits forever.
> The resource is now permanently dorked --forever because its
> driver is waiting forever. The user code never returns from
> the driver so it can never execute up(&semaphore).

What about something like a "robust mutex" (in OSDL terminology)? The
guy holding it too long gets killed, and the mutex gets marked as dirty.
The next guy to aquire the mutex is responsable for re-initializing
the resource (resetting the device to a known state, for instance).

Chris

2005-02-22 23:43:44

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Tue, 22 Feb 2005, Chris Friesen wrote:

> linux-os wrote:
>
>> Now, somebody needs a resource. It executes down(&semaphore);
>> once it gets control again, it has that resource. It attempts
>> to use that resource through a driver. The driver waits forever.
>> The resource is now permanently dorked --forever because its
>> driver is waiting forever. The user code never returns from
>> the driver so it can never execute up(&semaphore).
>
> What about something like a "robust mutex" (in OSDL terminology)? The guy
> holding it too long gets killed, and the mutex gets marked as dirty. The
> next guy to aquire the mutex is responsable for re-initializing the resource
> (resetting the device to a known state, for instance).
>
> Chris
>

All wonderful. However, it dosn't fix the problem. You are,
again, assuming that the problem is the symptom! The problem
is that some piece of code is not handling an exception
properly. It is waiting forever for something that will
never happen. It's that CODE that needs to be fixed.

"Cleaning" up the immediate symptoms doesn't let
the next thread that acquires the "cleaned up" lock
use the hardware because it has jammed code between
that thread and the hardware.

The bad code needs to be fixed. If the bad code is
fixed, you will __never__ have a process stuck
in 'D' state unless you run for the 1000 years
that could statistically result in a bit in
the semaphore getting flipped.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.10 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-02-23 00:25:37

by Chris Friesen

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

linux-os wrote:

Before I get into the reply, I just want to make it clear that I'm not
arguing that we *should* do any of this, just that it is not technically
impossible. It's a thought experiment, not a design suggestion.

> All wonderful. However, it dosn't fix the problem. You are,
> again, assuming that the problem is the symptom! The problem
> is that some piece of code is not handling an exception
> properly. It is waiting forever for something that will
> never happen. It's that CODE that needs to be fixed.

Absolutely. I'm just theorizing that it is possible to devise a system
that would be able to deal with such a situation, analogous to the way
the kernel can deal with bugs in userspace processes (segfaults, traps,
etc.).

> "Cleaning" up the immediate symptoms doesn't let
> the next thread that acquires the "cleaned up" lock
> use the hardware because it has jammed code between
> that thread and the hardware.

If the system is designed such that all resources are tracked, then you
could clean them up when the "hung" entity is killed (the way we do it
for userspace resources). In this case there is no more jammed code.
The next guy to aquire the mutex knows the hardware is in an
undetermined state, and is responsable for reinitializing it to a known
state. This would be horribly complicated, but I don't think it would
be impossible.

> The bad code needs to be fixed. If the bad code is
> fixed, you will __never__ have a process stuck
> in 'D' state unless you run for the 1000 years
> that could statistically result in a bit in
> the semaphore getting flipped.

I don't disagree with you on this. I think that fixing the bad code is
absolutely the way to go. I'm simply indulging in a thought experiment
as to whether or not it is theoretically possible to create a system
that would be able to clean up after this sort of thing once it has
happened.

Chris

2005-02-23 00:58:59

by Bodo Eggert

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

linux-os <[email protected]> wrote:

> You don't seem to understand. A process that's stuck in 'D' state
> shows a SEVERE error, usually with a hardware driver.

Or a network filesystem mount to a no longer existing server or share.

> For instance,
> somebody may have coded something in a critical section that will
> wait forever for some bit to be set when, in fact, that bit may
> never be set because of a hardware glitch. Such problems must
> be found. One can't just suck some process out of the 'D' state.

But you can easily fall into one, e.g. by mounting a SMB share to ~/mnt,
working until after the windows box breaks down and trying to save the
work of the last hour (which involves enumerating and stat()ing all
entries in ~).

> The 'D' state usually stands for 'Down' where a task
> was 'down()' on a semaphore. To get out of that state,
> that task (and none other) needs to execute `up()`.
> This means that whatever that task was waiting for
> needs to happen or it won't call 'up()'.

Maybe the device/mountpoint causing the processes to hang can be declared
dead (This is the more important part to me) and/or the syscall can be
forced to fail. If it involves wasting some MB of RAM for copying all
possibly affected memory in order to avoid corrupting used RAM, that
will be the price to pay for not losing your data.

How to clean up the stuck processes: (This requires a MMU)
Add an error path to each syscall (or create some generic error paths) and
keep the original stack frame. On errors, you can "longjump" (not exactly,
but similar) to the error path after copying the memory. The semaphore will
not be taken, and the code depending on the semaphore will not be executed.



BTW: Your Reply-To: should be omited if it's equal to the From:

2005-02-23 03:40:39

by Horst H. von Brand

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

Chris Friesen <[email protected]> said:
> Horst von Brand wrote:
> > Anthony DiSante <[email protected]> said:

> >>That's one of the things I asked a few messages ago. Some people on
> >>the list were saying that it'd be "really hard" and would "require a
> >>lot of bookkeeping" to "fix" permanently-D-stated processes... which is
> >>completely different than "impossible."

> > Most people here have little clue. It can't be done.

> I realize it would be extremely difficult if not impossible to do in the
> current linux architecture, but I find it hard to believe that it is
> technically impossible if one were allowed to design the system from
> scratch.

It is hard (if not impossible) to find out /what/ is broken (and how) and
fix it automatically. As you were told, D means the process is waiting for
some event. That event /might/ happen sometime (waiting for slow hardware)
or never (kernel programming error, hardware forgot the operation in
progress, ...). So you might fake it out by making believe the event did
happen. What if was just delayed, and /does/ then happen with nobody
waiting?

Any such is just papering over the problems, and is /massive/ complexity
for no real gain.

> Maybe I'm on crack, but would it not be technically possible to have all
> resource usage be tracked so that when a task tries to do something and
> hangs, eventually it gets cleaned up?

Sure. But there is /no way/ to know if the task will ever do something
(Turing's undecibility sees to that, even with perfect hardware), so the
only chance is to wait and see if the task releases it by itself. If you
just want to axe the task, you'd have to know beforehand what it will do
(and do it for the task on killing it). But the /task/ couldn't do it, what
guarantees the cleanup can?

> We already handle cleaning up stuff for userspace (memory, file
> descriptors, sockets, etc.).

On process end, i.e., when we know the stuff won't be used anymore. If the
program is stuck, kill it and go as before. If it doesn't go away cleanly,
something is /seriously/ wrong... and it is anybody's guess what.

> Why not enforce a design that says "all
> entities taking a lock must specify a maximum hold time".

It is hard enough to program without such restrictions. This would
incidentally also mean that the kernel has to be hard real time,
always. The usual PC hardware just isn't up to that, for starters.

And what would you do if you have nested locks, and the outer one times
out? Must kill the inner one beforehand... more complexity still.

> After that
> time expires, they are assumed to be hung, and all their resources
> (which were being tracked by some system) get cleaned up.

> It would probably be complicated, slow, and generally not worth the
> effort. But it seems at least technically possible.

If the system takes all extant resources for managing said resources, it is
somewhat pointless...
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2005-02-23 08:27:20

by Olaf Titz

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

In article <[email protected]> you write:
> The most recent one was yesterday: I had run lsusb in the morning and had no
> problems, but at the end of the day I ran it again, and after outputting 3
> lines of data, it hung, stuck in D-state. So now I have this:
>
> [/home/user]$ ps aux|grep D
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 92 0.0 0.0 0 0 ? D Feb19 0:00 [khubd]
> root 845 0.0 0.0 0 0 ? D Feb19 0:00 [knodemgrd_0]
> root 29016 0.0 0.1 1512 592 ? D 00:28 0:00 lsusb

I'm getting fairly repeatable deadlocks of this kind involving khubd
with a USB storage device. Perhaps there's just a faulty locking issue
in khubd.

Olaf

PS. Linux 2.6.9

2005-02-23 10:04:46

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Tue, 2005-02-22 at 22:05 -0300, Horst von Brand wrote:
> Chris Friesen <[email protected]> said:
[...]
> > Maybe I'm on crack, but would it not be technically possible to have all
> > resource usage be tracked so that when a task tries to do something and
> > hangs, eventually it gets cleaned up?
>
> Sure. But there is /no way/ to know if the task will ever do something
> (Turing's undecibility sees to that, even with perfect hardware), so the

ACK. But if root says "it will not come" (through whatever method), the
we have a decision good enough for real life.
The downside is that we need for each usage of these items explicit
checks and cleanup code (which wants to be written and tested) after
each usage.
Does this pay off?

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

2005-02-23 13:51:30

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Wed, 23 Feb 2005, Bodo Eggert wrote:

> linux-os <[email protected]> wrote:
>
>> You don't seem to understand. A process that's stuck in 'D' state
>> shows a SEVERE error, usually with a hardware driver.
>
> Or a network filesystem mount to a no longer existing server or share.
>

But that's a whole different problem. That's a systemic problem
of "fail-over". Network file-systems really need to interface
with an intermediate virtual device that can isolate failed
systems and make them look "perfect" to individual machines.

If you don't do this, then as soon as somebody trips over a
wire, your database is trashed. I'm surprised that NFS, PCNFS,
SMB, etc., actually work as well as everybody seems to
think they do. Until the architectural problem is resolved,
there are still going to be hung processes, trashed databases,
etc.

>> For instance,
>> somebody may have coded something in a critical section that will
>> wait forever for some bit to be set when, in fact, that bit may
>> never be set because of a hardware glitch. Such problems must
>> be found. One can't just suck some process out of the 'D' state.
>
> But you can easily fall into one, e.g. by mounting a SMB share to ~/mnt,
> working until after the windows box breaks down and trying to save the
> work of the last hour (which involves enumerating and stat()ing all
> entries in ~).
>

Yes. See above.

>> The 'D' state usually stands for 'Down' where a task
>> was 'down()' on a semaphore. To get out of that state,
>> that task (and none other) needs to execute `up()`.
>> This means that whatever that task was waiting for
>> needs to happen or it won't call 'up()'.
>
> Maybe the device/mountpoint causing the processes to hang can be declared
> dead (This is the more important part to me) and/or the syscall can be
> forced to fail. If it involves wasting some MB of RAM for copying all
> possibly affected memory in order to avoid corrupting used RAM, that
> will be the price to pay for not losing your data.
>

That's not how it's done.

> How to clean up the stuck processes: (This requires a MMU)
> Add an error path to each syscall (or create some generic error paths) and
> keep the original stack frame. On errors, you can "longjump" (not exactly,
> but similar) to the error path after copying the memory. The semaphore will
> not be taken, and the code depending on the semaphore will not be executed.
>

Again, you are attacking the symptom. The problem could be resolved
by using a local disk (or a disk file) for the immediate I/O and
the I/O to the file-servers could occur whenever they are available.
It's just ordinary transaction processing. Nothing new. It's just
that people continue to use primitive garbage (really, usually
developed by amateur hackers with no formal education) that is
then specified by the likes of Microsoft and then, to be compatible,
other operating systems create clones with the same kinds of
unfixable bugs.

>
> BTW: Your Reply-To: should be omited if it's equal to the From:
>

The problem with From: is this machine is not "known" to
the outside world, although somebody has entries in
the auth02.ns.uu.net name-server that claims to be my
machine, which gets cached and cloned everywhere. Mail
to this system needs to go to the Reply-To: address.

Our network "experts" here have tried to track down the
bad name-server entry and they say it's not here.

All of my machine names mysteriously appear in
auth02.ns.uu.net with 204.178.40.nnn IP addresses.
This really screws up email because email tries
to verify the sender by contacting those bogus
addresses.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.10 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-02-23 16:40:59

by Nish Aravamudan

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Tue, 22 Feb 2005 22:31:03 +0100, Olaf Titz <[email protected]> wrote:
> In article <[email protected]> you write:
> > The most recent one was yesterday: I had run lsusb in the morning and had no
> > problems, but at the end of the day I ran it again, and after outputting 3
> > lines of data, it hung, stuck in D-state. So now I have this:
> >
> > [/home/user]$ ps aux|grep D
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> > root 92 0.0 0.0 0 0 ? D Feb19 0:00 [khubd]
> > root 845 0.0 0.0 0 0 ? D Feb19 0:00 [knodemgrd_0]
> > root 29016 0.0 0.1 1512 592 ? D 00:28 0:00 lsusb
>
> I'm getting fairly repeatable deadlocks of this kind involving khubd
> with a USB storage device. Perhaps there's just a faulty locking issue
> in khubd.

Would you be willing to file a bugzilla (bugzilla.kernel.org) bug, if
it's still happening with 2.6.11-rc4? Or, if you have filed one,
please refer to it?

Thanks,
Nish

2005-02-23 16:55:13

by Parag Warudkar

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

I have recently run into similar issue involving processes stuck in D state - involves khubd and usb-storage. This happens with 2.6.11-rc4.

Check lkml for subject Re: [linux-usb-devel] 2.6: USB Storage hangs mac..

Parag


> On Tue, 22 Feb 2005 22:31:03 +0100, Olaf Titz <[email protected]> wrote:
> > In article <[email protected]> you write:
> > > The most recent one was yesterday: I had run lsusb in the morning and had no
> > > problems, but at the end of the day I ran it again, and after outputting 3
> > > lines of data, it hung, stuck in D-state. So now I have this:
> > >
> > > [/home/user]$ ps aux|grep D
> > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> > > root 92 0.0 0.0 0 0 ? D Feb19 0:00 [khubd]
> > > root 845 0.0 0.0 0 0 ? D Feb19 0:00 [knodemgrd_0]
> > > root 29016 0.0 0.1 1512 592 ? D 00:28 0:00 lsusb
> >
> > I'm getting fairly repeatable deadlocks of this kind involving khubd
> > with a USB storage device. Perhaps there's just a faulty locking issue
> > in khubd.
>
> Would you be willing to file a bugzilla (bugzilla.kernel.org) bug, if
> it's still happening with 2.6.11-rc4? Or, if you have filed one,
> please refer to it?
>
> Thanks,
> Nish
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2005-02-24 02:05:13

by Bodo Eggert

[permalink] [raw]
Subject: Re: uninterruptible sleep lockups

On Wed, 23 Feb 2005, linux-os wrote:
> On Wed, 23 Feb 2005, Bodo Eggert wrote:
> > linux-os <[email protected]> wrote:

> >> You don't seem to understand. A process that's stuck in 'D' state
> >> shows a SEVERE error, usually with a hardware driver.
> >
> > Or a network filesystem mount to a no longer existing server or share.
>
> But that's a whole different problem. That's a systemic problem
> of "fail-over". Network file-systems really need to interface
> with an intermediate virtual device that can isolate failed
> systems and make them look "perfect" to individual machines.
>
> If you don't do this, then as soon as somebody trips over a
> wire, your database is trashed. I'm surprised that NFS, PCNFS,
> SMB, etc., actually work as well as everybody seems to
> think they do. Until the architectural problem is resolved,
> there are still going to be hung processes, trashed databases,
> etc.

You don't run databases over a network filesystem unless you're begging
for trouble. For the other common purposes you'll usurally get a more
stable behaviour, since the failure on the client won't prevent the server
from properly writing the metadata or flushing the cache.

> > How to clean up the stuck processes: (This requires a MMU)
> > Add an error path to each syscall (or create some generic error paths) and
> > keep the original stack frame. On errors, you can "longjump" (not exactly,
> > but similar) to the error path after copying the memory. The semaphore will
> > not be taken, and the code depending on the semaphore will not be executed.
> >
>
> Again, you are attacking the symptom. The problem could be resolved
> by using a local disk (or a disk file) for the immediate I/O and
> the I/O to the file-servers could occur whenever they are available.

a) There are systems without local storage.

b) It won't help while stat()ing a non-cached object.

c) This would involve race conditions for e.g. two disconnected nodes on
reconnect. AFAI can see, this race can be solved by:
c1) The final transaction must be delayed until it's ACKed or
NACKed. This may delay the D-State for some seconds, but not enough.
c2) The server will have to keep track of the clients and need to be told
when a user left for a trip to the south pole without unmounting.
Very undesirable.
c3) Ignoring. Very, very undesireable.
c4) Requiring explicit transaction handling by the applications.
Interesting, but not in the near future.

d) This won't allow synchronous updates without falling back to classic
handling.

e) The users will update some files, get a positive reply and shut down
their PCs before the changes can be commited to the server. If the
server will not come back or the client is not rebooted within
reasonable time, this will cause silent data loss.

f) This will require reliable identification of the network server.

g) I'm not only thinking of NFS/..., allthough I used it as _the_ example.
E.g. if you see your IDE drive failing, you'll want to declare it dead
instead of waiting $num_of_sectors times five minutes until the kernel
decides to give up.


I agree that most D-states are problems that need to be fixed instead of
being worked-around, but sometimes you can't fix the problem without
access to the crystal-ball-device. Therefore all devices that can block
will need a manual override (with different probability), and the
processes that were stuck will need a way to recover or be stuck forever.

Obvoiusly the system is healthy enough to do some important and
uninterruptible work after those errors occured, so having them stuck will
be OK for now. Instead, the next task might be freeing the file
descriptors preventing you from unmounting your removable media or network
share or allowing really-forced umount.

--
Top 100 things you don't want the sysadmin to say:
54. Uh huh......"nu -k $USER".. no problem....sure thing...