2005-05-04 17:39:04

by Olivier Croquette

[permalink] [raw]
Subject: Scheduler: SIGSTOP on multi threaded processes

Hello

On a 2.6.11 x86 system, I am SIGSTOP'ing processes which have started
several threads before.

As expected, all threads are suspended.

But surprisingly, it can happen that some threads are still scheduled
after the SIGSTOP has been issued.

Typically, they get scheduled 2 times within the next 5ms, before being
really stopped.

Sadly, I could not reproduce that in a smaller example yet.

As this behaviour is IMA against the SIGSTOP concept, I tried to analyze
the kernel code responsible for that. I could not really find the exact
lines.

So here are my questions:

1. do you know any reason for which the SIGSTOP would not stop
immediatly all threads of a process?

2. where do the threads get suspended exactly in the kernel? I think it
is in signal.c but I am not sure exactly were.

3. can you confirm that the bug MUST be in my code? :)

Thanks!

Best regards

Olivier


2005-05-04 18:17:46

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

On Wed, 4 May 2005, Olivier Croquette wrote:

> Hello
>
> On a 2.6.11 x86 system, I am SIGSTOP'ing processes which have started
> several threads before.
>
> As expected, all threads are suspended.
>
> But surprisingly, it can happen that some threads are still scheduled
> after the SIGSTOP has been issued.
>
> Typically, they get scheduled 2 times within the next 5ms, before being
> really stopped.
>
> Sadly, I could not reproduce that in a smaller example yet.
>
> As this behaviour is IMA against the SIGSTOP concept, I tried to analyze
> the kernel code responsible for that. I could not really find the exact
> lines.
>
> So here are my questions:
>
> 1. do you know any reason for which the SIGSTOP would not stop
> immediatly all threads of a process?
>
> 2. where do the threads get suspended exactly in the kernel? I think it
> is in signal.c but I am not sure exactly were.
>
> 3. can you confirm that the bug MUST be in my code? :)
>
> Thanks!
>
> Best regards
>
> Olivier


The kernel doesn't do SIGSTOP or SIGCONT. Within init, there is
a SIGSTOP and SIGCONT handler. These can be inherited by others
unless changed, perhaps by a 'C' runtime library. Basically,
the SIGSTOP handler executes pause() until the SIGCONT signal
is received.

Any delay in stopping is the time necessary for the signal to
be delivered. It is possible that the section of code that
contains the STOP/CONT handler was paged out and needs to be
paged in before the signal can be delivered.

You might quicken this up by installing your own handler for
SIGSTOP and SIGCONT....

static int stp;

static void contsig(int sig) // SIGCONT handler
{
stp = 0;
}

static void stopsig(int sig) // SIGSTOP handler
{
stp = 1;
while(stp)
pause();
}

Put this near the code that will be executing most of the time.



Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-05-04 19:10:47

by Alexander Nyberg

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

> On a 2.6.11 x86 system, I am SIGSTOP'ing processes which have started
> several threads before.
>
> As expected, all threads are suspended.
>
> But surprisingly, it can happen that some threads are still scheduled
> after the SIGSTOP has been issued.
>
> Typically, they get scheduled 2 times within the next 5ms, before being
> really stopped.
>
> Sadly, I could not reproduce that in a smaller example yet.
>
> As this behaviour is IMA against the SIGSTOP concept, I tried to analyze
> the kernel code responsible for that. I could not really find the exact
> lines.
>
> So here are my questions:
>
> 1. do you know any reason for which the SIGSTOP would not stop
> immediatly all threads of a process?

The following scenario is possible:
program1 with a thread thread1

1) you send SIGSTOP to program1
2) thread1 is now scheduled and run.
3) program1 is now run and before it is scheduled off it notices it has
a signal set, makes sure all threads in the group gets SIGSTOP set.
4) thread1 is now scheduled and run again. now before it is scheduled
off it will find a signal pending and set itself in SIGSTOP.

There are absolutely no guarantees when a signal will be delivered.
Signals are delivered asynchronously.

> 2. where do the threads get suspended exactly in the kernel? I think it
> is in signal.c but I am not sure exactly were.

do_notify_resume()
do_signal()
get_signal_to_deliver()
do_signal_stop()
finish_stop()

> 3. can you confirm that the bug MUST be in my code? :)

You'll have to use reliable mechanisms to achieve what you're looking
for.

2005-05-04 19:16:17

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

On Wed, May 04, 2005 at 02:16:24PM -0400, Richard B. Johnson wrote:
> The kernel doesn't do SIGSTOP or SIGCONT. Within init, there is
> a SIGSTOP and SIGCONT handler. These can be inherited by others
> unless changed, perhaps by a 'C' runtime library. Basically,
> the SIGSTOP handler executes pause() until the SIGCONT signal
> is received.
>
> Any delay in stopping is the time necessary for the signal to
> be delivered. It is possible that the section of code that
> contains the STOP/CONT handler was paged out and needs to be
> paged in before the signal can be delivered.
>
> You might quicken this up by installing your own handler for
> SIGSTOP and SIGCONT....

I don't know what RTOSes you've been working with recently, but none of
the above is true for Linux. I don't think it ever has been.

--
Daniel Jacobowitz
CodeSourcery, LLC

2005-05-04 21:09:47

by Alex Riesen

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

On 5/4/05, Daniel Jacobowitz <[email protected]> wrote:
> On Wed, May 04, 2005 at 02:16:24PM -0400, Richard B. Johnson wrote:
> > The kernel doesn't do SIGSTOP or SIGCONT. Within init, there is
> > a SIGSTOP and SIGCONT handler. These can be inherited by others
> > unless changed, perhaps by a 'C' runtime library. Basically,
> > the SIGSTOP handler executes pause() until the SIGCONT signal
> > is received.
> >
> > Any delay in stopping is the time necessary for the signal to
> > be delivered. It is possible that the section of code that
> > contains the STOP/CONT handler was paged out and needs to be
> > paged in before the signal can be delivered.
> >
> > You might quicken this up by installing your own handler for
> > SIGSTOP and SIGCONT....
>
> I don't know what RTOSes you've been working with recently, but none of
> the above is true for Linux. I don't think it ever has been.
>

I don't even think it was true for anything. It's his usual way of
saying things.

2005-05-05 00:34:00

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

On Wed, 4 May 2005, Daniel Jacobowitz wrote:

> On Wed, May 04, 2005 at 02:16:24PM -0400, Richard B. Johnson wrote:
>> The kernel doesn't do SIGSTOP or SIGCONT. Within init, there is
>> a SIGSTOP and SIGCONT handler. These can be inherited by others
>> unless changed, perhaps by a 'C' runtime library. Basically,
>> the SIGSTOP handler executes pause() until the SIGCONT signal
>> is received.
>>
>> Any delay in stopping is the time necessary for the signal to
>> be delivered. It is possible that the section of code that
>> contains the STOP/CONT handler was paged out and needs to be
>> paged in before the signal can be delivered.
>>
>> You might quicken this up by installing your own handler for
>> SIGSTOP and SIGCONT....
>
> I don't know what RTOSes you've been working with recently, but none of
> the above is true for Linux. I don't think it ever has been.
>
> --
> Daniel Jacobowitz
> CodeSourcery, LLC
>

Grab a copy of your favorite init source. SIGSTOP and SIGCONT are
signals. They are handled by signal handlers, always have been
on Unix and Unix clones like Linux.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-05-05 00:42:26

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

On Wed, 4 May 2005, Alex Riesen wrote:

> On 5/4/05, Daniel Jacobowitz <[email protected]> wrote:
>> On Wed, May 04, 2005 at 02:16:24PM -0400, Richard B. Johnson wrote:
>>> The kernel doesn't do SIGSTOP or SIGCONT. Within init, there is
>>> a SIGSTOP and SIGCONT handler. These can be inherited by others
>>> unless changed, perhaps by a 'C' runtime library. Basically,
>>> the SIGSTOP handler executes pause() until the SIGCONT signal
>>> is received.
>>>
>>> Any delay in stopping is the time necessary for the signal to
>>> be delivered. It is possible that the section of code that
>>> contains the STOP/CONT handler was paged out and needs to be
>>> paged in before the signal can be delivered.
>>>
>>> You might quicken this up by installing your own handler for
>>> SIGSTOP and SIGCONT....
>>
>> I don't know what RTOSes you've been working with recently, but none of
>> the above is true for Linux. I don't think it ever has been.
>>
>
> I don't even think it was true for anything. It's his usual way of
> saying things.
>

Nope, I thought he was talking about the terminal stopper/starter,
SIGTSTP used for X-ON and X-OFF. I thought he was sending that signal,
timing it, then restarting with SIGCONT. You can't restart or
even trap a SIGSTOP signal.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-05-05 00:47:39

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

On Wed, 4 May 2005, linux-os (Dick Johnson) wrote:

> On Wed, 4 May 2005, Daniel Jacobowitz wrote:
>
>> On Wed, May 04, 2005 at 02:16:24PM -0400, Richard B. Johnson wrote:
>>> The kernel doesn't do SIGSTOP or SIGCONT. Within init, there is
>>> a SIGSTOP and SIGCONT handler. These can be inherited by others
>>> unless changed, perhaps by a 'C' runtime library. Basically,
>>> the SIGSTOP handler executes pause() until the SIGCONT signal
>>> is received.
>>>
>>> Any delay in stopping is the time necessary for the signal to
>>> be delivered. It is possible that the section of code that
>>> contains the STOP/CONT handler was paged out and needs to be
>>> paged in before the signal can be delivered.
>>>
>>> You might quicken this up by installing your own handler for
>>> SIGSTOP and SIGCONT....
>>
>> I don't know what RTOSes you've been working with recently, but none of
>> the above is true for Linux. I don't think it ever has been.
>>
>> --
>> Daniel Jacobowitz
>> CodeSourcery, LLC
>>
>
> Grab a copy of your favorite init source. SIGSTOP and SIGCONT are
> signals. They are handled by signal handlers, always have been
> on Unix and Unix clones like Linux.
>

Sorry. I thought he was talking about SIGTSTP and SIGCONT, the
X-ON X-OFF signals. I thought he was sending a SIGTSTP signal
to a task, timing it, then continuing with SIGCONT. He said that
it didn't operate fast enought.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-05-05 01:04:41

by Andy Isaacson

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

On Wed, May 04, 2005 at 02:16:24PM -0400, Richard B. Johnson wrote:
> On Wed, 4 May 2005, Olivier Croquette wrote:
> >On a 2.6.11 x86 system, I am SIGSTOP'ing processes which have started
> >several threads before.
>
> The kernel doesn't do SIGSTOP or SIGCONT.

Dear Wrongbot,

No.

-andy

2005-05-05 12:25:37

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes


I don't think the kernel handler gets a chance to do anything
because SYS-V init installs its own handler(s). There are comments
about Linux misbehavior in the code. It turns out that I was
right about SIGSTOP and SIGCONT...


Source-code header..... Current init version is 2.85 but I can't find
the source. This is 2.62

/*
* Init A System-V Init Clone.
*
* Usage: /sbin/init
* init [0123456SsQqAaBbCc]
* telinit [0123456SsQqAaBbCc]
*
* Version: @(#)init.c 2.62 29-May-1996 MvS
*
* This file is part of the sysvinit suite,

[SNIPPED...]

/*
* Linux ignores all signals sent to init when the
* SIG_DFL handler is installed. Therefore we must catch SIGTSTP
* and SIGCONT, or else they won't work....
*
* The SIGCONT handler
*/
void cont_handler()
{
got_cont = 1;
}

/*
* The SIGSTOP & SIGTSTP handler
*/
void stop_handler()
{
got_cont = 0;
while(!got_cont) pause();
got_cont = 0;
}


Now, if POSIX threads signals were implimented within the kernel,
without first purging the universe of all copies of the SYS-V init
that was distributed with early copies of RedHat and others (don't
know about current copies, a very long search failed to find the
source), then whatever you do in the kernel is wasted.

On Wed, 4 May 2005, Richard B. Johnson wrote:
> On Wed, 4 May 2005, Daniel Jacobowitz wrote:
>
>> On Wed, May 04, 2005 at 02:16:24PM -0400, Richard B. Johnson wrote:
>>> The kernel doesn't do SIGSTOP or SIGCONT. Within init, there is
>>> a SIGSTOP and SIGCONT handler. These can be inherited by others
>>> unless changed, perhaps by a 'C' runtime library. Basically,
>>> the SIGSTOP handler executes pause() until the SIGCONT signal
>>> is received.
>>>
>>> Any delay in stopping is the time necessary for the signal to
>>> be delivered. It is possible that the section of code that
>>> contains the STOP/CONT handler was paged out and needs to be
>>> paged in before the signal can be delivered.
>>>
>>> You might quicken this up by installing your own handler for
>>> SIGSTOP and SIGCONT....
>>
>> I don't know what RTOSes you've been working with recently, but none of
>> the above is true for Linux. I don't think it ever has been.
>>
>> --
>> Daniel Jacobowitz
>> CodeSourcery, LLC
>>
>
> Grab a copy of your favorite init source. SIGSTOP and SIGCONT are
> signals. They are handled by signal handlers, always have been
> on Unix and Unix clones like Linux.
>
> Cheers,
> Dick Johnson
> Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
> Notice : All mail here is now cached for review by Dictator Bush.
> 98.36% of all statistics are fiction.
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-05-05 13:15:59

by Denis Vlasenko

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

On Thursday 05 May 2005 15:24, Richard B. Johnson wrote:
>
> I don't think the kernel handler gets a chance to do anything
> because SYS-V init installs its own handler(s). There are comments
> about Linux misbehavior in the code. It turns out that I was
> right about SIGSTOP and SIGCONT...

No you are not.
--
vda

2005-05-05 13:30:26

by Andreas Schwab

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

"Richard B. Johnson" <[email protected]> writes:

> I don't think the kernel handler gets a chance to do anything
> because SYS-V init installs its own handler(s).

It's impossible to install a handler for SIGSTOP.

> There are comments about Linux misbehavior in the code. It turns out
> that I was right about SIGSTOP and SIGCONT...

No, you are wrong. SIGTSTP != SIGSTOP.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2005-05-05 22:04:36

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

In article <[email protected]>,
Richard B. Johnson <[email protected]> wrote:
>
>I don't think the kernel handler gets a chance to do anything
>because SYS-V init installs its own handler(s). There are comments
>about Linux misbehavior in the code. It turns out that I was
>right about SIGSTOP and SIGCONT...

No, you're confused. Sysvinit catches SIGTSTP and SIGCONT (not SIGSTOP)
because pid #1 is special - unlike all other processes, SIG_DFL for
pid #1 is equal to SIG_IGN.

And remember - signal handlers are not inherited (how could they be..)
so there is no such thing as "init installing a signal handler
for all processes".

Right now you should go out and buy a copy of the Stevens book,
"Advanced programming in the Unix enviroment", and study it.

Mike.

2005-05-06 23:27:52

by Yuly Finkelberg

[permalink] [raw]
Subject: Problem while stopping many threads within a module

Hello -

I'm having a strange thread scheduling issue in a project that I'm
working on. We have a module, with an interface that can be called by
many (currently 50) threads simulatenously. Threads that have entered
the kernel, sleep on a wait queue until everyone else has entered. At
this point, a "master" process wakes up the first thread, which does
some work, then wakes up the second, etc. After waking up its
successor, each thread changes its state to STOPPED and sends itself a
SIGSTOP. Note that the threads are created with
CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND but NOT CLONE_THREAD so
there is no group stop.

Basically, the structure is the following:
kernel_entry_point() {
wait until its your turn
...... do some work .... (serialized)
wake up the next thread
send SIGSTOP to yourself
}

At the same time, a monitoring process polls until all the threads
have stopped themselves:
monitor() {
repeat:
for each thread
if (thread->state < TASK_STOPPED)
yield()
goto repeat
}

Now, here's the problem. On 2.6.9 UP (Preempt), it is often the case
that one thread gets "stuck" in between the wake up of the next thread
and stopping itself -- this causes the monitor to poll for extended
periods of time, since the thread remains RUNNING. Strangely enough,
it generally gets unstuck by itself, sometimes within 10 seconds,
sometimes after as long as 10 minutes. When peeking at the kernel
stack of the offending process via the monitor, I only see that it is
in schedule and the stack looks like this:

c55e7ad0 00000086 c55e6000 c55e7a94 00000046 c55e6000 c55e7ad0 c0109c2d
00000000 c03ddae0 00000001 fd0b6c12 0013bc9f c6502130 001770fe fd478e5c
0013bc9f c55d546c c05d3960 00002710 c05d3960 c55e6000 c0106f25 c05d3960
Call Trace:
[<c0106f25>] need_resched+0x27/0x32

It also continues to be charged ticks, indicating that its being
scheduled but is making no progress? However, I can't find anything
that this thread could be spinning on. Also, I don't understand why
there is no further context on the stack -- the thread does eventually
finish and never leaves the kernel, so the stack shouldn't be
corrupted... How can it finish if it has nowhere to return?

I realize that this is a long shot, but if anyone has any ideas, I'd
appreciate hearing them. Please let me know if I can provide any
further information.

Thanks,
-Yuly

2005-05-10 21:06:19

by Olivier Croquette

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes


Hi all


I worked on my problem in the last days, and I came to these main 2
questions:

- Can a SIGSTOP be in a pending state in Linux?

- If kill(SIGSTOP,...) returns, does that mean that the corresponding
process is completly suspended?


I thought until now that SIGSTOP was so special that it could never be
pending, and that as soon as:
signal(SIGSTOP,pid)
returned, then it was assured that the corresponding process (and all
its threads) were suspended.

This would make sense in my opinion, but apparently it is not always the
case, and the POSIX norm do not say anything about that.

Any hint?


I did also some experiments, with one program which fork()s into:

- a child which potentially starts threads and does some stuff

- a parent which regularly sends SIGSTOP to the child and check if the
activity really stopped, and then send SIGCONT again

You will find the source code below.

I tried that with different scheduling policies (SCHED_OTHER and
SCHED_RR) and different number of threads:
- 0: no thread started (ie. mono threaded child)
- 1: 1 thread started, and the main task just pthread_join() it
- 2: 2 threads started, and the main task pthread_join() them

I came to the following results:

Policy OTHER RR
Threads
0 OK OK
1 FAIL OK
2 FAIL FAIL(1)


- the answer to my 2 questions (see above) see to be No and Yes
respectively when no thread is started

- (1) For RR with 2 threads, there are 2 observed behaviour, apparently
happening randomly:

* either the parent call always stop instantaneously all threads (like
when no thread is started), and that for a long time

* or right at the beginning, we can observe that the parent can not do
that

I find this behaviour really strange.

Any idea?

Can one rely on the fact that the SIGSTOP operates instantaneously for
non-threaded applications?

Would it be possible to provide that for all applications?




#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <sys/time.h>

#include <sys/types.h>
#include <sys/wait.h>
#include <sys/ipc.h>
#include <sys/shm.h>


#include <pthread.h>


int set_process_sched(pid_t pid, int policy, int priority) {
struct sched_param p;

p.sched_priority = priority;

if ( 1 || policy != sched_getscheduler(pid) ) {
if ( sched_setscheduler(pid,policy,&p) ) {
perror("sched_setscheduler()");
return 1;
}
}

return 0;
}

unsigned long long gettime(void ) {

struct timeval tv;

if ( gettimeofday(&tv, NULL) ) {
perror("gettimeofday()");
return 0;
}

return (tv.tv_usec + tv.tv_sec * 1000000LL);
}

typedef struct {
int thread_nb; /* id defined by us */
pthread_t thread_id; /* system id of the thread */
} thread_data;


int cont_main_loop = 1;


void sigterm_handler(int dummy) {
printf("sigterm_handler\n");
return;
}


/* We use a shared memory to communicate between the parent and the child
They all only work in the first few bytes
*/
int shmid;
unsigned long long int *shared_array;
#define SHM_SIZE 1024

static inline void conf_shmem(void ) {

shmid = shmget(IPC_PRIVATE, SHM_SIZE, 0666 | IPC_CREAT);
if (shmid == -1) {
perror("shmget()");
exit(0);
}

shared_array = (long long int *) shmat(shmid, 0, 0);
if (! shared_array ) {
perror("shmat()");
exit(0);
}
}


void loop(int marker) {
unsigned long long int begin = gettime();
/* run for 2 minutes at max
(useful in case we end up with a busy loop in SCHED_RR... */
while ( gettime() - begin < 120000000LL ) {
/* write in the shared memory */
shared_array[0] = marker;
}
}

void *go_thread(void *dummy) {
thread_data *data = (thread_data *) dummy;
loop(data->thread_nb);
fprintf(stderr,"%llu\tQuitting!\n",gettime());
return NULL;
}


#define MAX_THREADS 100

int main(int argc, char **argv)
{
int pid;
int test_failed = 0;
unsigned long long exec_begin = gettime();
int nb_threads = 0;


conf_shmem();
shared_array[0] = 0;

if ( argc > 1 )
nb_threads = atoi(argv[1]);
if ( nb_threads > MAX_THREADS )
nb_threads = MAX_THREADS;

pid = fork();

switch ( pid ) {

case 0: /* child */
{
int thread;
thread_data threads[MAX_THREADS];

if ( nb_threads == 0 ) {
/* no multi threading */
loop(1);
break;
}

/* start the threads */
for ( thread = 0 ; thread < nb_threads ; thread ++) {
threads[thread].thread_nb = thread + 1;
if ( pthread_create ( & threads[thread].thread_id,
NULL,
go_thread,
(void *)&threads[thread]) )
perror("pthread_create");

}

{
int thread;
for ( thread = 0 ; thread < nb_threads ; thread ++) {
pthread_join ( threads[thread].thread_id, NULL);
}
}
exit(0);
}

default: /* parent */
{
unsigned long long begin = gettime();

/* depending whether we set the priorities or not,
we get different results.
*/

set_process_sched(0, SCHED_RR, 65);
set_process_sched(pid, SCHED_RR, 60);


/* run for 10s */
while ( gettime() - begin < 10000000 ) {
unsigned long long int b_stop, a_stop;

/* let the child run a little bit */
usleep(1000);

/* stop it */
kill(pid, SIGSTOP);

/* Reset our flag */
shared_array[0] = 0;

/* Wait to see if someone dare overwriting our nice zero */
usleep(1000);
if ( shared_array[0] > 0 ) {
test_failed = shared_array[0];
break;
}
kill(pid, SIGCONT);
}
kill(pid, SIGKILL);
break;
}

case -1:
perror("fork()");
exit(0);
}

system("uname -a");
printf("%d thread(s)\n",nb_threads);
if ( ! test_failed )
printf("test passed");
else
printf("test FAILED (%d)",test_failed);
printf(" after %f s\n\n", ( gettime() - exec_begin) / 1000000.0 );

return 0;
}


2005-05-10 21:15:15

by Roland McGrath

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

> - Can a SIGSTOP be in a pending state in Linux?

For short periods.

> - If kill(SIGSTOP,...) returns, does that mean that the corresponding
> process is completly suspended?

No. One or more threads of the process may still be running on another CPU
momentarily before they process the interrupt and stop for the signal.

2005-05-10 23:05:29

by Alex Riesen

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes

This: http://www.opengroup.org/onlinepubs/009695399/toc.htm
and probably all other issues of Open Group is very interesting reading.

2005-05-11 18:59:45

by Olivier Croquette

[permalink] [raw]
Subject: Re: Scheduler: SIGSTOP on multi threaded processes


Hello Roland

Thanks for your reply.

>>- Can a SIGSTOP be in a pending state in Linux?
>
> For short periods.
>
>>- If kill(SIGSTOP,...) returns, does that mean that the corresponding
>>process is completly suspended?
>
> No. One or more threads of the process may still be running on another CPU
> momentarily before they process the interrupt and stop for the signal.


I get sometimes 150ms delay between the end of kill() and suspension of
the last thread of the 3 threads, on a single-CPU system (Pentium 4).

It seems understandable to me to have a delay of <=1ms, especialy on SMP
systems, but I really can't understand:

- the so big delays (like the 150ms)

- why only multi-threaded applications make problems

- why the policy of the programs has an impact on the results

- why for some executions, the SIGSTOP effect is instantaneous 100s of
times in a row, until the end of the test, and the next execution shows
delays right from the beginning


I don't have much experience hacking the kernel, are these behaviours
are quite difficult for me to monitor or trace.
I am beginning to run out of ideas to test further :(

Could it be that my observations undercover a problem?
Or are the a consequence of the Linux implementation?
Or do I have a problem in my test bench?

Can anyone reproduce and/or validate these observations?

Any hint would be appreciated!