LinuxLists.cc - RE: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED

2004-10-29 14:38:06

Subject: RE: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)

I have reproduced this hang on 2.6.10-rc1-bk7, and have also installed the sysrq-n patch. Even after "SysRq : Nice All RT Tasks",
the system is completely unresponsive as far as user mode is concerned, and will only react to SysRq. It -does- respond to ICMP
pings. Sysrq-e, -k, -i do not stop the offending tt1 process.

I do not have netdump available in 2.6.10-rc1-bk7, and so cannot provide a full sysrq-t output, but the visible section shows two
tt1 threads with identical stacks:

schedule_timeout+0xd0/0xd2
futex_wait+0x140/0x1a9
do_futex+0x33/0x78
sys_futex+0xcd/0xd9
sysenter_past_esp+0x52/0x71

I then tried running this task as non-root user, which should prevent SCHED_RR and PRIO changes of the threads/tasks. Under these
conditions, the system does *not* hang. I noticed that the app periodically ends up in a high-speed loop involving the
ACE_Semaphore class in ACE; having checked the compilation flags, it seems ACE is simulating semaphors using below calls. It is
*not* using POSIX 1003.1b semaphores (sem_wait, etc.)

pthread_mutex_lock()
pthread_cond_wait()
pthread_cond_signal()

Although it appears I need to fix an applicaiton bug, is it normal/desirable for an application calling system mutex facilities to
starve the system so completely, and/or become "unkillable"?

A.

-----Original Message-----
From: Andrew [mailto:[email protected]]
Sent: Thursday, October 28, 2004 5:10 PM
To: [email protected]
Cc: [email protected]; Andrew Morton
Subject: Consistent lock up 2.6.8-1.521 (and 2.6.8.1 w/
high-res-timers/skas/sysemu)

Caveat: This may be an infinite loop in a SCHED_RR process. See very bottom of email for sysrq-t sysrq-p output.

[LARGE EMAIL DELETED]

2004-10-29 15:47:17

by Andrew A.

[permalink] [raw]

Subject: RE: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)

For whatever reason, I have been unable to send the original message below through vger.

I am therefore enclosing some of the important text here:

==========

I have in the past posted emails with the subject "Consistent kernel hang during heavy TCP connection handling load" I then
recently saw a linux-kernel thread "PROBLEM: Consistent lock up on >=2.6.8" that seemed to be related to the problem I am
experiencing, but I am not sure of it (thus the cc:).

I have, however, now managed to formulate a series of steps which reproduce my particular lock up consistently. When I trigger the
hang, all virtual consoles become unresponsive, and the application cannot be signaled from the keyboard. Sysrq seems to work.

The application in question is called "tt1". It runs several threads in SCHED_RR and uses select(), sleep() and/or nanosleep()
extensively. I suspect there's a good chance the application calls select() with nfds=0 at some point.

Due to the SCHED_RR usage in tt1, before executing the tt1 hang, I have tried to log into a virtual console on the host and run
"nice -20 bash" as root. THe nice'd shell is hung just like everything else.

Did I do it right? I was trying to make sure this hang is not simply an infinite loop in a SCHED_RR high priority process (tt1).

I initially had a lot of trouble trying to capture sysrq output, but then I checked my netlog host and found (lo and behold) that it
had captured it! Of course, that was before I went through the trouble of taking pictures of my monitor! I've included the netlog
sysrq output from two runs below. They are at the very bottom of this email, separated by lines of '*'s These runs are probably
DIFFERENT than the runs from which I produced the below screenshots.

So, here are those screenshots, I still welcome any comments you might have about easier ways to capture sysrq output than using
netdump!

I modified /etc/syslog.conf to say kern.* /var/log/kernel, however, output of sysrq-t and sysrq-p while in the locked up state never
ends up in the file (though, it does, when not locked up).

The sysreq output and screenshots can be found at triple w dot memeplex dot com slash

sysrq[1-2].txt.gz
lock[1-3].gif mapping to System.map-2.6.8.1.gz
lock[4-5].gif mapping to System.map-2.6.8-1.521.gz

=======

2004-10-29 16:14:11

by Andrew A.

[permalink] [raw]

Subject: RE: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)

For whatever reason, I have been unable to send the original message below through vger.

I am therefore enclosing some of the important text here:

==========

I have in the past posted emails with the subject "Consistent kernel hang during heavy TCP connection handling load" I then
recently saw a linux-kernel thread "PROBLEM: Consistent lock up on &gr;&eq;2.6.8" that seemed to be related to the problem I am
experiencing, but I am not sure of it (thus the cc:).

I have, however, now managed to formulate a series of steps which reproduce my particular lock up consistently. When I trigger the
hang, all virtual consoles become unresponsive, and the application cannot be signaled from the keyboard. Sysrq seems to work.

The application in question is called "tt1". It runs several threads in SCHED_RR and uses select(), sleep() and/or nanosleep()
extensively. I suspect there's a good chance the application calls select() with nfds=0 at some point.

Due to the SCHED_RR usage in tt1, before executing the tt1 hang, I have tried to log into a virtual console on the host and run
"nice -20 bash" as root. THe nice'd shell is hung just like everything else.

Did I do it right? I was trying to make sure this hang is not simply an infinite loop in a SCHED_RR high priority process (tt1).

I initially had a lot of trouble trying to capture sysrq output, but then I checked my netlog host and found (lo and behold) that it
had captured it! Of course, that was before I went through the trouble of taking pictures of my monitor! I've included the netlog
sysrq output from two runs below. They are at the very bottom of this email. These runs are probably DIFFERENT than the runs from
which I produced the below screenshots.

So, here are those screenshots, I still welcome any comments you might have about easier ways to capture sysrq output than using
netdump!

I modified /etc/syslog.conf to say kern.* /var/log/kernel, however, output of sysrq-t and sysrq-p while in the locked up state never
ends up in the file (though, it does, when not locked up).

The sysreq output and screenshots can be found at triple w dot memeplex dot com slash

sysrq[1-2].txt.gz
lock[1-3].[g][i][f] mapping to System.map-2.6.8.1.gz
lock[4-5].[g][i][f] mapping to System.map-2.6.8-1.521.gz

=======

2004-10-29 16:22:32

by Alan

[permalink] [raw]

Subject: RE: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)

On Gwe, 2004-10-29 at 15:26, Andrew wrote:
> Although it appears I need to fix an applicaiton bug, is it normal/desirable for an application calling system mutex facilities to
> starve the system so completely, and/or become "unkillable"?

If it is SCHED_RR then it may get to hog the processor but it should not
be doing worse than that and should be killable by something higher
priority.

You are right to suspect futexes don't deal with hard real time but the
failure you see isnt the intended failure case.

[Inaky has posted some drafts of a near futex efficient lock system that
ought to work for real time use btw]

Alan

2004-10-29 16:52:28

by Andrew A.

[permalink] [raw]

Subject: RE: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)

Alan:

Thanks for your note. The application in question is not "hard RT" and I am using SCHED_RR to improve latency, rather than
guarantee a particular latency number. Also, since I am using the ACE framework, and don't have the time to detangle its
protability preprocesor macros to add support for a different futex/mutex mechanism, I'm inclined to use stock code. I did dig up
Inaky's work which is a fusyn mapping to existing futex calls--I might try that.

However, would any of that really solve this problem? That is, do lower priority non-RR tasks and/or kernel signal delivery benefit
from additional scheduled time under those patches?

I suspect what is happening here is that my process is essentially in a

while(1)
{
lock();
unlock();
}

loop from two or mode SCHED_RR threads running at nice -15. They seem to be unkillable.

However, should we really dismiss the possibility that the problem could be that these threads are in some kind of deadlock that
involves the scheduler?

A.

-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Alan Cox
Sent: Friday, October 29, 2004 11:07 AM
To: Andrew
Cc: Linux Kernel Mailing List; [email protected]; Andrew Morton
Subject: RE: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)

On Gwe, 2004-10-29 at 15:26, Andrew wrote:
> Although it appears I need to fix an applicaiton bug, is it normal/desirable for an application calling system mutex facilities to
> starve the system so completely, and/or become "unkillable"?

If it is SCHED_RR then it may get to hog the processor but it should not
be doing worse than that and should be killable by something higher
priority.

You are right to suspect futexes don't deal with hard real time but the
failure you see isnt the intended failure case.

[Inaky has posted some drafts of a near futex efficient lock system that
ought to work for real time use btw]

Alan

2004-10-29 17:12:13

by Chris Wright

[permalink] [raw]

Subject: Re: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)

* Andrew A. ([email protected]) wrote:
> I suspect what is happening here is that my process is essentially in a
>
> while(1)
> {
> lock();
> unlock();
> }
>
> loop from two or mode SCHED_RR threads running at nice -15. They seem to be unkillable.

Give yourself a shell that's SCHED_RR with a higher priority. I've used
the small hack below to debug userspace SCHED_RR problems (newer distros
have a chrt utility to do this).

thanks,
-chris
--

#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <sched.h>
#include <string.h>
#include <errno.h>

main(int argc, char *argv[])
{
pid_t pid = 0;
int priority = 99;
int policy = SCHED_RR;
struct sched_param sched;

if (argc > 1) {
pid = atoi(argv[1]);
if (argc > 2) {
priority = atoi(argv[2]);
if (argc > 3)
policy = atoi(argv[3]);
}
}

memset(&sched, 0, sizeof(sched));
sched.sched_priority = priority;
if (sched_setscheduler(pid, policy, &sched) < 0) {
printf("setscheduler: %s\n", strerror(errno));
exit(1);
}

if (!pid) { /* turn this into a shell */
argv[0] = "/bin/bash";
argv[1] = NULL;
execv(argv[0], argv);
}

}

2004-10-29 17:47:52

by Andrew A.

[permalink] [raw]

Subject: RE: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)

chrt 25 bash

Shell remains as badly hung as everything else. The code sets the SCHED_RR priority of the task and threads in tt1 to 10. I'm left
thinking: Shouldn't the system be scheduling the shell? Is this a problem with priority inversion due to 2+ threads doing the
lock()/unlock() dance and never giving the bash a chance to run? Is the system able to schedule signal and/or select wakeups (for
bash) in this condition?

Thanks, I wasn't aware of the chrt command and had only been using nice on my shell. The man pages on all this stuff are rather
confusing: Which priority numbers are valid, how priorities interact, negative vs. positive priorities, process vs. thread
priority, what is a "dynamic" vs. "static" priority, etc.

My impression after re-re-read reading the man pages was that it would be sufficient to have a non SCHED_RR shell with a high enough
nice value.

-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Chris Wright
Sent: Friday, October 29, 2004 1:07 PM
To: Andrew A.
Cc: Alan Cox; Linux Kernel Mailing List; [email protected]; Andrew
Morton
Subject: Re: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)

* Andrew A. ([email protected]) wrote:
> I suspect what is happening here is that my process is essentially in a
>
> while(1)
> {
> lock();
> unlock();
> }
>
> loop from two or mode SCHED_RR threads running at nice -15. They seem to be unkillable.

Give yourself a shell that's SCHED_RR with a higher priority. I've used
the small hack below to debug userspace SCHED_RR problems (newer distros
have a chrt utility to do this).

thanks,
-chris
--

#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <sched.h>
#include <string.h>
#include <errno.h>

main(int argc, char *argv[])
{
pid_t pid = 0;
int priority = 99;
int policy = SCHED_RR;
struct sched_param sched;

if (argc > 1) {
pid = atoi(argv[1]);
if (argc > 2) {
priority = atoi(argv[2]);
if (argc > 3)
policy = atoi(argv[3]);
}
}

memset(&sched, 0, sizeof(sched));
sched.sched_priority = priority;
if (sched_setscheduler(pid, policy, &sched) < 0) {
printf("setscheduler: %s\n", strerror(errno));
exit(1);
}

if (!pid) { /* turn this into a shell */
argv[0] = "/bin/bash";
argv[1] = NULL;
execv(argv[0], argv);
}

}

2004-10-29 20:43:48

by Chris Wright

[permalink] [raw]

Subject: Re: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)

* Andrew A. ([email protected]) wrote:
>
> chrt 25 bash

Try 99.

> Shell remains as badly hung as everything else. The code sets the SCHED_RR priority of the task and threads in tt1 to 10. I'm left
> thinking: Shouldn't the system be scheduling the shell? Is this a problem with priority inversion due to 2+ threads doing the
> lock()/unlock() dance and never giving the bash a chance to run? Is the system able to schedule signal and/or select wakeups (for
> bash) in this condition?

Not knowing what tt1 is doing it's hard to say. Ah, I missed the
priority you used, so 99 above shouldn't be needed.

> Thanks, I wasn't aware of the chrt command and had only been using nice on my shell. The man pages on all this stuff are rather
> confusing: Which priority numbers are valid, how priorities interact, negative vs. positive priorities, process vs. thread
> priority, what is a "dynamic" vs. "static" priority, etc.

Dynamic is adjusted by the behaviour (using up timeslice, blocking,
waiting to run) or by nice. Static is the base value used when figuring
out what the dynamic should be (can be changed via nice or setpriority).
IIRC, realtime priorities effectively stay static (unless changed
via sched_setscheduler). The dynamic priority is what's used in
scheduling decisions. The userspace interfaces are a bit confusing.
The kernel keeps track of it a bit more simply. Internally, the
priority ranges between 0 and 139 (0 is highest priority). 0-99 are for
realtime tasks, and 100-139 are for normal tasks (note how the top 40
priorties can map to nice values -- where -20 == 100, and 19 == 139).
The nice(2) (and setpriority(2)) interface lets you adjust the static
priority in that upper range (and the dynamic changes accordingly).
The sched_setscheduler(2) ranges for realtime [1, 99] map exactly inverted
to the kernels priority (so while the syscall has 99 as highest priority,
that becomes 0 internally).

> My impression after re-re-read reading the man pages was that it would be sufficient to have a non SCHED_RR shell with a high enough
> nice value.

High enough priority set via sched_setscheduler(2), not nice value.
nice [1, 19] actually lowers your priority, while [-20, -1] increases it.

thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net