2004-11-22 16:30:28

by Ray Bryant

[permalink] [raw]
Subject: scalability of signal delivery for Posix Threads

We've encountered a scalability problem with signal delivery. Our application
is attempting to use ITIMER_PROF to deliver one signal per clock tick to each
thread of a ptrheaded (NPTL). These threads are created with CLONE_SIGHAND,
so that there is a single sighand->siglock for the entire application.

On our Altix systems, everything works fine until we increase the number of
threads (and processors, with one thread bound to each processor) beyond
about 112 threads or so. At that point lock contention can become severe
enough to make the system unresponsive. The reason is that each thread has
to obtain the (global to the application, in this case) lock sighand->siglock.

(Obviously, one solution is to recode the application to send fewer signals
per thread as the number of threads increase. However, we are concerned by
the fact that a user application, of any kind, can be constructed in a way
that causes system to become responsive and would like to find a solutiuon
that would let us correctly execute the program as described.)

Perusing the kernel sources shows that this global lock is used, in many
cases, to protect data that is purely local to the current thread. (For
example, see block_all_signals(), where current->sighand->siglock is obtained
and then the current task's signal mask is manipulated.)

This lock is also aquired in the following routines, where most of the time
is being spent our application:

ia64_do_signal
set_signal_to_deliver
ia64_rt_sigreturn

In these cases, it appears that the global lock is being acquired to make sure
the signal handler definition doesn't change underneath us, as well as dealing
with the per thread signal data.

Since signals are sent much more often than sigaction() is called, it would
seem to make more sense to make sigaction() take a heavier weight lock of
some kind (to update the signal handler decription) and to have the signal
delivery mechanism take a lighter weight lock. Making
current->sighand->siglock a rwlock_t really doesn't improve the situation
much, since cache line contention is just a severe in that case (if not worse)
than it is with the current definition.

It seems to me that scalability would be improved if we moved the siglock from
the sighand structure to the task_struct. (keep reading, please...) Code
that manipulates the current task signal data only would just obtain that
lock. Code that needs to change the sighand structure (e. g. sigaction())
would obtain all of the siglock's of all tasks using the same sighand
structure. A list of those task_struct's would be added to the sighand
structure to enable finding these structurs without having to take the
task_list_lock and search for them.

Obviously, this change ricochet's throughout the entire signal handling code.
It also means that sigaction() can become quite expensive for a many threaded
POSIX application, but my guess is that this doesn't happen very often.
The change could also make do_exit(), thread group shutdown, etc slower and
perhaps somewhat more complex.

Anyway, we would be interested in the community's ideas about dealing with
this signal delivery scalability issue, and, comments on the solution above
or suggestions for alternative solutions are welcome.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------


2004-11-22 16:58:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] scalability of signal delivery for Posix Threads

On Mon, Nov 22, 2004 at 05:51:59PM +0100, Andreas Schwab wrote:
> Andi Kleen <[email protected]> writes:
>
> > At least in traditional signal semantics you have to call sigaction
> > or signal in each signal handler to reset the signal. So that
> > assumption is not necessarily true.
>
> If you use sigaction then you get POSIX semantics, which don't have this
> problem.

It's just a common case where Ray's assumption is not true.

-Andi

2004-11-22 17:29:19

by Philip Mucci

[permalink] [raw]
Subject: Re: [Lse-tech] scalability of signal delivery for Posix Threads

Hi Andi,

Allow me to say that this particular case is very special, because
ITIMER_PROF is used in many performance tools. Inside of PAPI for
example, we use ITIMER_PROF for both counter multiplexing and for
statistical profiling. Tools such as HPCToolkit, PerfSuite and others
often enable ITIMER_PROF at the highest available resolution. So while
there may be hundreds or thousands of ways to do this, this particular
avenue has many useful tools out there that make can easily trigger this
case.

However, I think your suggestion is an excellent one regarding a fast
path for ITIMER_PROF.

FWIW, Solaris has ITIMER_REALPROF, which when enabled, sends
ITIMER_PROFs to all LWP's in the process for each tick. In this way,
each thread doesn't have to call setitimer()
by itself to get a signal...yeah yeah, I know what POSIX says about
signal delivery to any LWP, but on every Linux I've tested, you only get
itimer(PROF) to the thread that registered the timer. Granted,
I haven't run the test on an Altix.

Regards,

Philip Mucci

> I suspect there are hundreds or thousands of ways on such a big system to
> exploit some lock to make the system unresponsive. If you wanted
> to fix them all your would be in a large scale redesign effort.
> It's not clear why this particular case is special.

> > Since signals are sent much more often than sigaction() is called, it would
> > seem to make more sense to make sigaction() take a heavier weight lock of
>
> At least in traditional signal semantics you have to call sigaction
> or signal in each signal handler to reset the signal. So that
> assumption is not necessarily true.
>
> > It seems to me that scalability would be improved if we moved the siglock
> > from
> > the sighand structure to the task_struct. (keep reading, please...) Code
> > that manipulates the current task signal data only would just obtain that
> > lock. Code that needs to change the sighand structure (e. g. sigaction())
> > would obtain all of the siglock's of all tasks using the same sighand
> > structure. A list of those task_struct's would be added to the sighand
> > structure to enable finding these structurs without having to take the
> > task_list_lock and search for them.
>
> Taking all these locks without risking deadlock would be tricky.
> You could just use a ring, but would need to point to a common
> anchor and always start from there to make sure all lock grabbers
> aquire the locks in the same order.
>
> > Anyway, we would be interested in the community's ideas about dealing with
> > this signal delivery scalability issue, and, comments on the solution above
> > or suggestions for alternative solutions are welcome.
>
> How about you figure out a fast path of some signals that can work
> without locking: e.g. no load balancing needed, no queued signal, etc.
> and then just do the delivery of SIGPROF lockless? Or just ignore it
> since the original premise doesn't seem to useful.
>
> -Andi
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://productguide.itmanagersjournal.com/
> _______________________________________________
> Lse-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lse-tech

2004-11-22 17:26:01

by Robin Holt

[permalink] [raw]
Subject: Re: scalability of signal delivery for Posix Threads

On Mon, Nov 22, 2004 at 09:51:15AM -0600, Ray Bryant wrote:
> We've encountered a scalability problem with signal delivery. Our
> application
> is attempting to use ITIMER_PROF to deliver one signal per clock tick to
> each
> thread of a ptrheaded (NPTL). These threads are created with CLONE_SIGHAND,
> so that there is a single sighand->siglock for the entire application.

Ray, can you provide a simple example application that trips this case?

2004-11-22 19:30:40

by Ray Bryant

[permalink] [raw]
Subject: Re: scalability of signal delivery for Posix Threads

Robin Holt wrote:
> On Mon, Nov 22, 2004 at 09:51:15AM -0600, Ray Bryant wrote:
>
>>We've encountered a scalability problem with signal delivery. Our
>>application
>>is attempting to use ITIMER_PROF to deliver one signal per clock tick to
>>each
>>thread of a ptrheaded (NPTL). These threads are created with CLONE_SIGHAND,
>>so that there is a single sighand->siglock for the entire application.
>
>
> Ray, can you provide a simple example application that trips this case?

I'll do that, but it may be after the Thanksgiving holiday when I get it
out to the mailing list. It's a minor thing, but the test we are using
now is an OpenMP application (written in .c) and I'd propose rewriting
it using POSIX threads without OpenMP for a more general audience

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-11-22 19:57:37

by Ray Bryant

[permalink] [raw]
Subject: Re: [Lse-tech] Re: scalability of signal delivery for Posix Threads

Matthew Wilcox wrote:
> On Mon, Nov 22, 2004 at 09:51:15AM -0600, Ray Bryant wrote:
>
>>Since signals are sent much more often than sigaction() is called, it would
>>seem to make more sense to make sigaction() take a heavier weight lock of
>>some kind (to update the signal handler decription) and to have the signal
>>delivery mechanism take a lighter weight lock. Making
>>current->sighand->siglock a rwlock_t really doesn't improve the situation
>>much, since cache line contention is just a severe in that case (if not
>>worse) than it is with the current definition.
>
>
> What about RCU or seqlock?
>

Well, the sighand->siglock is taken so many places in the kernel (>200 times)
that RCUing its usage looks like a daunting change to make.

In principle, I guess a seqlock could be made to work. The idea would be that
we'd want to get a consistent copy of the sighand structure in the presence
of very rare updates. Once again, I'd have to modify all of those code
paths mentioned above.

Since a seqlock was created AFAIK as an alternate to a brlock, and the
global/local locking structure I described before is more or less equivalent
to a brlock, I think we are thinking along similar lines.

For me, converting spinlock_irqsave(&p->sighand->siglock) to
spinlock_irqsave(&p->siglock) and then checking to make sure that
only task local data is updated in the critical section is an easier
way to go than modifying each of the code paths to deal with the
"failure" case for a seqlock. But I could be proven wrong.

Anyway, Andi makes a good point -- if I can fast patch SIGPROF handling,
then I may have a more localized change, and that is a good thing [tm].

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-11-22 19:57:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: scalability of signal delivery for Posix Threads

> Well, the sighand->siglock is taken so many places in the kernel (>200
> times)
> that RCUing its usage looks like a daunting change to make.

Agreed. And having to wait for all CPUs in sigaction would also not
be nice.

>
> In principle, I guess a seqlock could be made to work. The idea would be

seqlocks are reader only, but for signal delivery you need a writer to
update state like the thread load balancing. We got all that gunk
from POSIX, before NPTL it would have been probably possible ;-)

-Andi

2004-11-22 20:30:42

by Ray Bryant

[permalink] [raw]
Subject: Re: [Lse-tech] scalability of signal delivery for Posix Threads

OK, apparently SIGPROF is delivered in both the ITIMER_PROF and
pmu interrupt cases, so if we special case that signal we should
be fine.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-11-22 21:34:10

by Rick Lindsley

[permalink] [raw]
Subject: Re: [Lse-tech] scalability of signal delivery for Posix Threads

So with CLONE_SIGHAND, we share the handler assignments and which signals
are blocked, but retain the ability for individual threads to receive
a signal. And when all of them receive signals in quick succession,
we see lock contention because they're sharing the same (effectively)
global lock to receive all of their (effectively) individual signals
.. is that correct?

Are you contending on tasklist_lock, or on siglock?

It seems to me that scalability would be improved if we moved the
siglock from the sighand structure to the task_struct.

Only if you want to keep its current semantics of it being a lock for
all things signal. Finer granularity would, it seems at first look,
afford you the benefits you're looking for. (But not without the cost of
a fair amount of work to make sure the new locks are utilized correctly.)
For the problem you're describing, it sounds like the contention is occuring
at delivery, so a new lock for pending, blocked, and real_blocked might be
in order.

Rick

2004-11-22 23:41:44

by Ray Bryant

[permalink] [raw]
Subject: Re: [Lse-tech] scalability of signal delivery for Posix Threads

Rick Lindsley wrote:
> So with CLONE_SIGHAND, we share the handler assignments and which signals
> are blocked, but retain the ability for individual threads to receive
> a signal. And when all of them receive signals in quick succession,
> we see lock contention because they're sharing the same (effectively)
> global lock to receive all of their (effectively) individual signals
> .. is that correct?
>

Yes, I think that's whats happening, except that I think the blocked
signal list is per thread as well. The shared sighand structure just
has the saved arguments from sigaction, as I remember. (It's confusing:
the set of signals blocked during execution of the signal handler is
part of the sigaction structure and hence is global to the entire
thread group, whilst the set of signals blocked in general is per thread.)

> Are you contending on tasklist_lock, or on siglock?

Definately: siglock. All of the profiling ticks occur at
unlock_irqrestore(&p->sighand->siglock) in the routines I
mentioned before. [we don't have NMI profiling on Altix...
so profiling typically can't look inside
of code sections with interrupts suspended.]

>
> It seems to me that scalability would be improved if we moved the
> siglock from the sighand structure to the task_struct.
>
> Only if you want to keep its current semantics of it being a lock for
> all things signal. Finer granularity would, it seems at first look,
> afford you the benefits you're looking for. (But not without the cost of
> a fair amount of work to make sure the new locks are utilized correctly.)
> For the problem you're describing, it sounds like the contention is occuring
> at delivery, so a new lock for pending, blocked, and real_blocked might be
> in order.
>
> Rick
>

Yes, I was hoping to keep the current semantics of siglock as the lock for
all things signal, just make it local per thread, and require that all of the
siglocks be held to change the sighand structure. That seemed like a change I
could manage. My personal notion was that the slowdown of sigaction()
processing for multi-threaded POSIX programs was not that big of deal because
it doesn't happen very often, and for non-CLONE_SIGHAND threads the additional
cost would be minor. But if the slowdown in the CLONE_SIGHAND case is not
acceptable then I'm stuck as to how to do this

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-11-22 19:00:45

by Ray Bryant

[permalink] [raw]
Subject: Re: [Lse-tech] scalability of signal delivery for Posix Threads

Andi Kleen wrote:
> On Mon, Nov 22, 2004 at 05:51:59PM +0100, Andreas Schwab wrote:
>
>>Andi Kleen <[email protected]> writes:
>>
>>
>>>At least in traditional signal semantics you have to call sigaction
>>>or signal in each signal handler to reset the signal. So that
>>>assumption is not necessarily true.
>>
>>If you use sigaction then you get POSIX semantics, which don't have this
>>problem.
>
>
> It's just a common case where Ray's assumption is not true.
>
> -Andi
>

True enough. And in that case the design that I was describing wouldn't
make sigaction() that much more expensive since if you are not in the POSIX
thread environment (more precisely, the thread was not created with
CLONE_SIGHAND) each thread has its own sighand structure and the "global"
locking mechanisum I had proposed would only require the taking of one
additional lock.

However, special casing ITIMER_PROF is also a reasonable avenue of approach.
The performance monitor code can also deliver signals to user space when
a sampling buffer overflows, and this can have the same kind of scaling
problem as ITIMER_PROF. I'll have to do a little research to figure out
how exactly that works, but that signal (SIGIO?) would also be a candidate
for special casing on our platform.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-11-22 16:55:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] scalability of signal delivery for Posix Threads

On Mon, Nov 22, 2004 at 09:51:15AM -0600, Ray Bryant wrote:
> (Obviously, one solution is to recode the application to send fewer signals
> per thread as the number of threads increase. However, we are concerned by
> the fact that a user application, of any kind, can be constructed in a way
> that causes system to become responsive and would like to find a solutiuon
> that would let us correctly execute the program as described.)

I suspect there are hundreds or thousands of ways on such a big system to
exploit some lock to make the system unresponsive. If you wanted
to fix them all your would be in a large scale redesign effort.
It's not clear why this particular case is special.

> Since signals are sent much more often than sigaction() is called, it would
> seem to make more sense to make sigaction() take a heavier weight lock of

At least in traditional signal semantics you have to call sigaction
or signal in each signal handler to reset the signal. So that
assumption is not necessarily true.

> It seems to me that scalability would be improved if we moved the siglock
> from
> the sighand structure to the task_struct. (keep reading, please...) Code
> that manipulates the current task signal data only would just obtain that
> lock. Code that needs to change the sighand structure (e. g. sigaction())
> would obtain all of the siglock's of all tasks using the same sighand
> structure. A list of those task_struct's would be added to the sighand
> structure to enable finding these structurs without having to take the
> task_list_lock and search for them.

Taking all these locks without risking deadlock would be tricky.
You could just use a ring, but would need to point to a common
anchor and always start from there to make sure all lock grabbers
aquire the locks in the same order.

> Anyway, we would be interested in the community's ideas about dealing with
> this signal delivery scalability issue, and, comments on the solution above
> or suggestions for alternative solutions are welcome.

How about you figure out a fast path of some signals that can work
without locking: e.g. no load balancing needed, no queued signal, etc.
and then just do the delivery of SIGPROF lockless? Or just ignore it
since the original premise doesn't seem to useful.

-Andi

2004-11-22 16:55:35

by Andreas Schwab

[permalink] [raw]
Subject: Re: [Lse-tech] scalability of signal delivery for Posix Threads

Andi Kleen <[email protected]> writes:

> At least in traditional signal semantics you have to call sigaction
> or signal in each signal handler to reset the signal. So that
> assumption is not necessarily true.

If you use sigaction then you get POSIX semantics, which don't have this
problem.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2004-11-22 16:41:00

by Matthew Wilcox

[permalink] [raw]
Subject: Re: scalability of signal delivery for Posix Threads

On Mon, Nov 22, 2004 at 09:51:15AM -0600, Ray Bryant wrote:
> Since signals are sent much more often than sigaction() is called, it would
> seem to make more sense to make sigaction() take a heavier weight lock of
> some kind (to update the signal handler decription) and to have the signal
> delivery mechanism take a lighter weight lock. Making
> current->sighand->siglock a rwlock_t really doesn't improve the situation
> much, since cache line contention is just a severe in that case (if not
> worse) than it is with the current definition.

What about RCU or seqlock?

--
"Next the statesmen will invent cheap lies, putting the blame upon
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince
himself that the war is just, and will thank God for the better sleep
he enjoys after this process of grotesque self-deception." -- Mark Twain

2004-11-23 20:53:40

by Ray Bryant

[permalink] [raw]
Subject: Re: scalability of signal delivery for Posix Threads


#define _GNU_SOURCE

#include <sys/syscall.h>
#include <pthread.h>
#include <sys/time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <signal.h>
#include <errno.h>

#define NITER 100000000
#define perrorx(s) (perror(s), exit(1))
#define MAX_THREADS 512
#define CACHE_LINE_SIZE 128
#define __cache_aligned __attribute__((__aligned__(CACHE_LINE_SIZE)))

static struct itimerval itv;

/* this makes count a per thread variable */
static __thread unsigned volatile long count;

pthread_t ptid[MAX_THREADS];
struct child_state_struct {
int state;
pid_t tid;
long start_count;
long end_count;
double result;
char filler[CACHE_LINE_SIZE - (sizeof(int)+sizeof(pid_t)+2*sizeof(long)+sizeof(double))];
};
struct child_state_struct child_state[MAX_THREADS] __cache_aligned;

volatile int go __cache_aligned = 0;
char filler[CACHE_LINE_SIZE - (sizeof(int))];

static void
sigprof_handler(int signo, struct siginfo *sip, void *scp)
{
++count;
}

void *func(void *arg)
{
int i, id = (int)(long)arg;
double a = 0.1;
pid_t tid;

tid = syscall(__NR_gettid);
child_state[id].tid = tid;
child_state[id].state = 1;

if (setitimer(ITIMER_PROF, &itv, 0) < 0) {
fprintf(stderr, "Setitimer failed: %s\n", strerror(errno));
exit(1);
}

while(go<1);

child_state[id].start_count = count;
for (i=0;i<NITER;++i)
a = cos(asin(a));
child_state[id].end_count = count;
child_state[id].result = a;

child_state[id].state = 2;

while(go<2);

}

int
main(int argc, char *argv[])
{
struct sigaction sa;
int j, started, finished, thread_count;

if (argc < 2) {
printf("Call as '%s NN', where NN is the number of threads to create.\n", argv[0]);
exit(1);
}

thread_count = atol(argv[1]);
printf("Creating %d threads.\n", thread_count);

itv.it_value.tv_sec = 0;
itv.it_value.tv_usec = 1000000.0 / (double)sysconf(_SC_CLK_TCK);
itv.it_interval = itv.it_value;

memset(&sa, 0, sizeof(sa));
sigfillset(&sa.sa_mask);
sa.sa_sigaction = sigprof_handler;
sa.sa_flags = SA_RESTART | SA_SIGINFO;

if (sigaction(SIGPROF, &sa, 0) < 0) {
fprintf(stderr, "Sigaction failed: %s\n", strerror(errno));
exit(1);
}

for (j=0; j<thread_count; j++)
if (pthread_create(&ptid[j], NULL, &func, (void *)(long) j) < 0)
perrorx("pthread create");

do {
sleep(1);
started = 0;
for (j=0; j<thread_count; j++)
started += child_state[j].state;
} while(started < thread_count);

printf("All threads have started.\n");
go++;

do {
sleep(1);
finished = 0;
for (j=0; j<thread_count; j++)
finished += (child_state[j].state == 2);
} while(finished < thread_count);
printf("All threads have finished.\n");

for (j=0; j<thread_count; j++)
printf("count[%d]=%d\n",
j, child_state[j].end_count-child_state[j].start_count);

return 0;
}


Attachments:
y.c (2.88 kB)