One such stuck process had the following backtrace:
#0 schedule_timeout (timeout=-150765944) at timer.c:864
#1 0xc01a28a3 in uart_wait_until_sent (tty=0xf7669000, timeout=2147483647)
at core.c:1320
#2 0xc01afca8 in tty_wait_until_sent (tty=0xf7669000, timeout=0)
at tty_ioctl.c:66
#3 0xc01b0049 in set_termios (tty=0xf7669000, arg=3221221720, opt=2)
at tty_ioctl.c:164
#4 0xc01b03dc in n_tty_ioctl (tty=0xf7669000, file=0xf716b8a0, cmd=21507,
arg=3221221720) at tty_ioctl.c:409
#5 0xc01ac3b0 in tty_ioctl (inode=0xf72b9cf4, file=0xf716b8a0, cmd=21507,
arg=3221221720) at tty_io.c:1798
#6 0xc0152cf6 in sys_ioctl (fd=0, cmd=21507, arg=3221221720) at ioctl.c:128
#7 0xc01078df in syscall_call () at process.c:982
This doesn't appear to be serial-specific. Another stuck process is:
(gdb) bt
#0 schedule_timeout (timeout=2147483647) at timer.c:836
#1 0xc01af23d in read_chan (tty=0xf73da000, file=0xf781f6a0,
buf=0xbffffd93 "", nr=1) at n_tty.c:1043
#2 0xc01aa4b6 in tty_read (file=0xf781f6a0, buf=0xbffffd93 "", count=1,
ppos=0xf781f6c0) at tty_io.c:677
#3 0xc0142e2c in vfs_read (file=0xf781f6a0, buf=0xbffffd93 "", count=1,
pos=0xf781f6c0) at read_write.c:193
#4 0xc0142ffe in sys_read (fd=0, buf=0xbffffd93 "", count=1)
at read_write.c:232
#5 0xc01078df in syscall_call () at process.c:982
It's actually possible to kick these by sending them signal-generating
characters, though the forward progress one can make this way is limited.
(1) type "ls &"
(2) This will not echo.
(3) type ^Z (^C doesn't work for some reason)
(4) "ls &" echoes
(5) no prompt appears
(6) type ^Z again
(7) the prompt doesn't appear
(8) type ^Z again
(9) the prompt appears
... this is a little oversimplified. Some pounding on the return keys is
usually also required. serial and non-serial behave identically here.
Cheers,
Bill
William Lee Irwin III wrote:
>
> One such stuck process had the following backtrace:
>
> #0 schedule_timeout (timeout=-150765944) at timer.c:864
schedule_timeout is FASTCALL, which confuses gdb's stack crawler
somewhat.
> #1 0xc01a28a3 in uart_wait_until_sent (tty=0xf7669000, timeout=2147483647)
> at core.c:1320
Might be a bit racy?
--- 2.5.32/drivers/serial/core.c~serial-race Wed Aug 28 15:22:22 2002
+++ 2.5.32-akpm/drivers/serial/core.c Wed Aug 28 15:22:26 2002
@@ -1315,13 +1315,14 @@ static void uart_wait_until_sent(struct
* 'timeout' / 'expire' give us the maximum amount of time
* we wait.
*/
+ set_current_state(TASK_INTERRUPTIBLE);
while (!port->ops->tx_empty(port)) {
- set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(char_time);
if (signal_pending(current))
break;
if (time_after(jiffies, expire))
break;
+ set_current_state(TASK_INTERRUPTIBLE);
}
set_current_state(TASK_RUNNING); /* might not be needed */
}
.
On Wed, Aug 28, 2002 at 03:25:20PM -0700, Andrew Morton wrote:
> William Lee Irwin III wrote:
> >
> > One such stuck process had the following backtrace:
> >
> > #0 schedule_timeout (timeout=-150765944) at timer.c:864
>
> schedule_timeout is FASTCALL, which confuses gdb's stack crawler
> somewhat.
>
> > #1 0xc01a28a3 in uart_wait_until_sent (tty=0xf7669000, timeout=2147483647)
> > at core.c:1320
>
> Might be a bit racy?
>
> --- 2.5.32/drivers/serial/core.c~serial-race Wed Aug 28 15:22:22 2002
> +++ 2.5.32-akpm/drivers/serial/core.c Wed Aug 28 15:22:26 2002
> @@ -1315,13 +1315,14 @@ static void uart_wait_until_sent(struct
> * 'timeout' / 'expire' give us the maximum amount of time
> * we wait.
> */
> + set_current_state(TASK_INTERRUPTIBLE);
> while (!port->ops->tx_empty(port)) {
> - set_current_state(TASK_INTERRUPTIBLE);
> schedule_timeout(char_time);
> if (signal_pending(current))
> break;
> if (time_after(jiffies, expire))
> break;
> + set_current_state(TASK_INTERRUPTIBLE);
> }
> set_current_state(TASK_RUNNING); /* might not be needed */
> }
Patch looks good, as far as correctness goes. However, since char_time
will be the amount of time for one character, we should never sleep long
enough for the user to notice this slip-up.
If people are seeing deadlocks, I agree with wli that there's something
very wrong elsewhere.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
Russell King wrote:
>
> ...
> > --- 2.5.32/drivers/serial/core.c~serial-race Wed Aug 28 15:22:22 2002
> > +++ 2.5.32-akpm/drivers/serial/core.c Wed Aug 28 15:22:26 2002
> > @@ -1315,13 +1315,14 @@ static void uart_wait_until_sent(struct
> > * 'timeout' / 'expire' give us the maximum amount of time
> > * we wait.
> > */
> > + set_current_state(TASK_INTERRUPTIBLE);
> > while (!port->ops->tx_empty(port)) {
> > - set_current_state(TASK_INTERRUPTIBLE);
> > schedule_timeout(char_time);
> > if (signal_pending(current))
> > break;
> > if (time_after(jiffies, expire))
> > break;
> > + set_current_state(TASK_INTERRUPTIBLE);
> > }
> > set_current_state(TASK_RUNNING); /* might not be needed */
> > }
>
> Patch looks good, as far as correctness goes. However, since char_time
> will be the amount of time for one character, we should never sleep long
> enough for the user to notice this slip-up.
>
> If people are seeing deadlocks, I agree with wli that there's something
> very wrong elsewhere.
Well Bill's trace is claiming that we're doing a schedule_timeout(0x7fffffff)
for some reason.
But yes, he seems to be able to hit it too frequently for this to be
the cause.
On Wed, Aug 28, 2002 at 05:05:44PM -0700, Andrew Morton wrote:
> Well Bill's trace is claiming that we're doing a schedule_timeout(0x7fffffff)
> for some reason.
Umm, I didn't see that:
wli> #0 schedule_timeout (timeout=-150765944) at timer.c:864
wli> #1 0xc01a28a3 in uart_wait_until_sent (tty=0xf7669000, timeout=2147483647)
wli> at core.c:1320
Its legal for uart_wait_until_sent to be called with 2147483647 (it means
"wait until all characters are sent no matter what"). However, we'll
still call schedule_timeout with a really small value (one character
time) which means we'll be waking up pretty regularly there.
> But yes, he seems to be able to hit it too frequently for this to be
> the cause.
wli - please let me know if Andrew's patch makes any difference for you.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
On Wed, Aug 28, 2002 at 05:05:44PM -0700, Andrew Morton wrote:
>> But yes, he seems to be able to hit it too frequently for this to be
>> the cause.
On Thu, Aug 29, 2002 at 01:20:48AM +0100, Russell King wrote:
> wli - please let me know if Andrew's patch makes any difference for you.
It's not as easy as doing a single run. It occurs "often" but not
predictably. Some runs will succeed and others will trigger it, so I
can't reliably tell whether it's been prevented. At the very least
several attempts need to be made. I'd just apply it since it's
apparently correct from the audit and I'll try to catch you again when
it shows up again (like earlier today). Unfortunately that test run was
on a freshly improvised tree not including this fix. I've included it
there now and will just have to remember it for the new tree coming up
(2.5.32-mm4 of course). I had hoped the backtraces would be more helpful,
and that someone might know what they waited for that had never happened.
I only get this once out of 10 or 20 boots, so turnaround is slow. =(
I did a bunch of boots with it +2.5.32-mm1 but don't remember the results.
Cheers,
Bill