2004-11-15 11:59:10

by Ulrich Windl

[permalink] [raw]
Subject: CPU hogs ignoring SIGTERM (unkillable processes)

Hello,

today I've discovered a programming error in one of my programs (that's fixed
already). When trying to replace the binary, I found out that the processes seem
unaffected by a plain "kill": They just continue to consume CPU. However, a "kill
-9" terminates them. ist that intended behavior? I guess not. Here are some facts:

top - 12:51:33 up 145 days, 22:20, 1 user, load average: 8.06, 8.77, 9.51
Tasks: 85 total, 9 running, 76 sleeping, 0 stopped, 0 zombie
Cpu(s): 57.4% user, 1.9% system, 0.0% nice, 40.6% idle
Mem: 191192k total, 90716k used, 100476k free, 19004k buffers
Swap: 132088k total, 29140k used, 102948k free, 37580k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ Command
28042 daemon 25 0 92 32 28 R 13.2 0.0 163:59.44 extractMIME
29435 daemon 25 0 92 32 28 R 13.2 0.0 156:34.27 extractMIME
31211 daemon 25 0 92 32 16 R 13.2 0.0 145:12.22 extractMIME
156 daemon 25 0 92 32 16 R 13.2 0.0 135:24.49 extractMIME
4079 daemon 25 0 92 32 16 R 13.2 0.0 109:48.36 extractMIME

This is about 3 minutes after executing this command:
kill 27176 27457 28042 29435 31211 156 4079

This happened for both SUSE kernels, old and new:
m1 2.4.20-4GB #1 Mon Mar 17 17:54:44 UTC 2003
m2 2.6.5-7.111-default #1 Wed Oct 13 15:45:13 UTC 2004

And no, the C program does not install any signal handler. If interested I can
provide the binary together with sample parameters to execute the loop.

Regards,
Ulrich


2004-11-15 14:09:34

by Andreas Schwab

[permalink] [raw]
Subject: Re: CPU hogs ignoring SIGTERM (unkillable processes)

"Ulrich Windl" <[email protected]> writes:

> Hello,
>
> today I've discovered a programming error in one of my programs (that's fixed
> already). When trying to replace the binary, I found out that the processes seem
> unaffected by a plain "kill": They just continue to consume CPU. However, a "kill
> -9" terminates them. ist that intended behavior? I guess not. Here are some facts:

Are you sure it doesn't block or ignore the signal?

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2004-11-16 07:55:27

by Ulrich Windl

[permalink] [raw]
Subject: Re: CPU hogs ignoring SIGTERM (unkillable processes)

On 15 Nov 2004 at 14:39, Andreas Schwab wrote:

> "Ulrich Windl" <[email protected]> writes:
>
> > Hello,
> >
> > today I've discovered a programming error in one of my programs (that's fixed
> > already). When trying to replace the binary, I found out that the processes seem
> > unaffected by a plain "kill": They just continue to consume CPU. However, a "kill
> > -9" terminates them. ist that intended behavior? I guess not. Here are some facts:
>
> Are you sure it doesn't block or ignore the signal?

Andreas,

I don't mess with signals (as said); the code just parses the same area of memory
again and again (due to a programming error). As offered, you can get the binary
and the sample command line to repeat (or at the least: try to) situation if you
like. If I hadn't experienced it, I wouldn't believe myself ;-)

Regards,
Ulrich

2004-11-16 10:54:46

by Andreas Schwab

[permalink] [raw]
Subject: Re: CPU hogs ignoring SIGTERM (unkillable processes)

"Ulrich Windl" <[email protected]> writes:

> On 15 Nov 2004 at 14:39, Andreas Schwab wrote:
>
>> "Ulrich Windl" <[email protected]> writes:
>>
>> > Hello,
>> >
>> > today I've discovered a programming error in one of my programs (that's fixed
>> > already). When trying to replace the binary, I found out that the processes seem
>> > unaffected by a plain "kill": They just continue to consume CPU. However, a "kill
>> > -9" terminates them. ist that intended behavior? I guess not. Here are some facts:
>>
>> Are you sure it doesn't block or ignore the signal?
>
> Andreas,
>
> I don't mess with signals (as said);

That is not required. It could just as well inherit the setting from the
parent.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2004-11-16 13:22:00

by Ulrich Windl

[permalink] [raw]
Subject: Re: CPU hogs ignoring SIGTERM (unkillable processes)

On 16 Nov 2004 at 11:42, Andreas Schwab wrote:

> "Ulrich Windl" <[email protected]> writes:
>
> > On 15 Nov 2004 at 14:39, Andreas Schwab wrote:
> >
> >> "Ulrich Windl" <[email protected]> writes:
> >>
> >> > Hello,
> >> >
> >> > today I've discovered a programming error in one of my programs (that's fixed
> >> > already). When trying to replace the binary, I found out that the processes seem
> >> > unaffected by a plain "kill": They just continue to consume CPU. However, a "kill
> >> > -9" terminates them. ist that intended behavior? I guess not. Here are some facts:
> >>
> >> Are you sure it doesn't block or ignore the signal?
> >
> > Andreas,
> >
> > I don't mess with signals (as said);
>
> That is not required. It could just as well inherit the setting from the
> parent.

OOps! Now that you are telling me, I realize that the program that hung was
actually started by a shell script that was in turn exec'd() (after a fork()) from
a truly multi-threaded application.

I couldn't capture the binary via "ps -axs", because it's terribly efficient, but
I managed to captue the shell script that way:

UID PID PENDING BLOCKED IGNORED CAUGHT
STAT TTY TIME COMMAND
2 7792 0000000000000000 0000000080014003 8000000000001004 0000000000010002 S
? 0:00 /bin/sh /usr/local/milter/Sopho

The manpage on execve() isn't too verbose on the topic:

...
execve() does not return on success, and the text, data,
bss, and stack of the calling process are overwritten by
that of the program loaded. The program invoked inherits
the calling process's PID, and any open file descriptors
that are not set to close on exec. Signals pending on the
calling process are cleared. Any signals set to be caught
by the calling process are reset to their default
behaviour. The SIGCHLD signal (when set to SIG_IGN) may
or may not be reset to SIG_DFL.
...

"Yet Another UNIX-like OS" dared to state in the man page:

The processing of signals by the process is unchanged by exec*(),
except that signals caught by the process are set to their default
values (see signal(2)).

However the same man page states that pending signals are not cleared.
Interesting.

The process also retains the following attributes:

...
+ pending signals
...

This is the rerason for returning the following errno, I guess:

[EINTR] A signal was caught during the exec*() system
call.


Let me summarize: The Kernel has no problem, just the docs ;-) Prpgrammers are fed
with docs of course...

Thanks and regards,
Ulrich Windl