This concerns a scheduler bug that we've been looking at in 2.6.27.8 on
ia64 Altix machines.
Below is the current testcase that we're using to reproduce this problem.
The test should be run as follows:
taskset -c <cpu> t -c <another cpu> -n <non-zero nice level>
It creates 5 pthreads. Each of those threads changes their thread affinity
to the cpu passed in, sets their nice value to the passed in value, and
counts in a loop.
Depending on what cpus are chosen for the pair of cpus passed into taskset
and the test itself, output will look something like the following:
# taskset -c 3 t -c 2 -n -5
Create thread 0
Create thread 1
slave 0: pid 4086 cpu 2 nlevel -5
Create thread 2
slave 1: pid 4087 cpu 2 nlevel -5
Create thread 3
slave 2: pid 4088 cpu 2 nlevel -5
Create thread 4
slave 3: pid 4089 cpu 2 nlevel -5
slave 4: pid 4090 cpu 2 nlevel -5
Wait ready
GO
Counts:
0: 47996540
1: 47548033
2: 3419
3: 3249
4: 67665404
Counts:
0: 96877777
1: 95094916
2: 3419
3: 3249
4: 116545241
Note that threads 2 and 3 have stopped running. They will eventually run
at some time in the future, but this length of time can be extremely long
depending on the pair of cpus chosen. The test usually fails with the same
pattern if the same pair of cpus is always chosen. It will run differently
for different cpu pairs, and will run fine for certain pairs of cpus.
After some examination we've found the overall reason for this. The
runqueue rb tree is sorted by the vruntime value associated with each
thread. These vruntime values can occasionally get set to values far above
other thread's vruntime values, leaving those threads to starve waiting to
be moved up in the runqueue.
One place we've found this happens is in update_curr(), which calculates a
delta_exec value as follows:
delta_exec = (unsigned long)(now - curr->exec_start);
Sometimes this value will be very large, as 'now' (the rq clock time) will
be less than 'exec_start'. When this happens, __update_curr() will
calculate a delta_exec_weighted based on this large value and add it to the
thread's vruntime:
curr->vruntime += delta_exec_weighted;
For this particular case, we've found that if we add a hack to clamp the
calculation of delta_exec to always be positive, this will allow the above
testcase to run.
+ if (now < curr->exec_start)
+ curr->exec_start = now - 1;
delta_exec = (unsigned long)(now - curr->exec_start);
I'm wondering if anyone else has run into this problem, and whether they've
found a resolution for it?
/*
* gcc -o jtest jtest.c -lpthread
* jtest -c8 -v
*/
#define _GNU_SOURCE 1
#include <sched.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/syscall.h>
#include <sys/types.h>
#define NTHR 5
#define perrorx(s) do {perror(s); exit(1);} while (0)
extern int optind, opterr;
extern char *optarg;
static void *slave(void *arg);
volatile int ready, go = 0;
struct slave_s {
int cpu;
int id;
int nlevel;
} slave_t[NTHR];
unsigned long x[NTHR];
static int runonx(int cpu)
{
cpu_set_t mask;
if (cpu < 0 || cpu >= 1024)
return -1;
memset(&mask, 0, sizeof (mask));
CPU_SET(cpu, &mask);
if (sched_setaffinity(0, sizeof(mask), &mask) < 0)
perrorx("setaffinity failed");
return 0;
}
int main(int argc, char **argv)
{
int c, th, k = 0, er = 0;
int cpu = 1;
int nlevel = 0;
unsigned loops = (unsigned)-1;
unsigned long i = 0;
static char optstr[] = "c:l:n:v";
pthread_t *threads;
opterr = 1;
while ((c = getopt(argc, argv, optstr)) != EOF)
switch (c) {
case 'c':
cpu = atoi(optarg);
break;
case 'l':
loops = atoi(optarg);
break;
case 'n':
nlevel = atoi(optarg);
break;
case '?':
er = 1;
break;
}
if (er)
exit(1);
sleep(1);
threads = malloc(sizeof(pthread_t) * NTHR);
for (th = 0; th < NTHR; th++) {
printf("Create thread %d\n", th);
slave_t[th].cpu = cpu;
slave_t[th].id = th;
slave_t[th].nlevel = nlevel;
if (pthread_create(&threads[th], NULL, slave, (void *)&slave_t[th]) != 0)
perrorx("pthread_create failed:");
}
runonx(cpu);
if (1 || nlevel) {
errno = 0;
if (nice(nlevel) == -1 && errno) {
perror("nice");
exit(-1);
}
}
printf("Wait ready\n");
while (ready < NTHR)
usleep(10000);
printf("GO\n");
while(k<loops) {
i++;
if (i % 10000000 == 0) {
int j;
printf("Counts:\n");
for (j=0; j<NTHR; j++)
printf("%d: %lu\n",j,x[j]);
printf("\n");
k++;
}
}
go++;
for (th = 0; th < NTHR; th++)
pthread_join(threads[th], NULL);
printf("Master Done\n");
return 0;
}
static void *slave(void *arg)
{
struct slave_s * s = (struct slave_s *)arg;
runonx(s->cpu);
if (1 || s->nlevel) {
errno = 0;
if (nice(s->nlevel) == -1 && errno) {
perror("nice");
exit(-1);
}
}
printf("slave %d: pid %ld cpu %d nlevel %d\n", s->id, syscall(SYS_gettid), s->cpu, s->nlevel);
(void)__sync_fetch_and_add(&ready, 1);
while (!go) {
x[s->id]++;
}
printf("slave %d done\n", s->id);
pthread_exit(0);
}
Dimitri Sivanich wrote:
>
> I'm wondering if anyone else has run into this problem, and whether they've
> found a resolution for it?
>
>
Hi Dimitri,
I have observed similar problems on my opensuse 11.0 box (2.6.25
based) when doing kernel builds. Often times other apps will get
starved (like firefox, etc). I was talking to peterz about it and he
pointed me at a commit that went into 28-rcx that fixes this (the sha
escapes me at the moment). I've been meaning to pull the patch into our
2.6.25 and 2.6.27 trees to verify if it fixes it, but I have fairly high
confidence it will. Perhaps Peter can point you at the same patch and
you can give it a whirl. For the time being, can you see if 28 is fixed?
-Greg
Hi Greg,
On Mon, Jan 05, 2009 at 01:44:41PM -0500, Gregory Haskins wrote:
> Dimitri Sivanich wrote:
> >
> > I'm wondering if anyone else has run into this problem, and whether they've
> > found a resolution for it?
> >
> >
>
> Hi Dimitri,
> I have observed similar problems on my opensuse 11.0 box (2.6.25
> based) when doing kernel builds. Often times other apps will get
> starved (like firefox, etc). I was talking to peterz about it and he
> pointed me at a commit that went into 28-rcx that fixes this (the sha
> escapes me at the moment). I've been meaning to pull the patch into our
> 2.6.25 and 2.6.27 trees to verify if it fixes it, but I have fairly high
> confidence it will. Perhaps Peter can point you at the same patch and
> you can give it a whirl. For the time being, can you see if 28 is fixed?
>
2.6.28 appears to be a little bit worse, actually:
Counts:
0: 50165855
1: 8879
2: 5462
3: 4990
4: 91805284
^C
# uname -r
2.6.28-03114-g3c92ec8
The master thread hangs up after printing only one set of counts.
Dimitri Sivanich wrote:
> Hi Greg,
>
> On Mon, Jan 05, 2009 at 01:44:41PM -0500, Gregory Haskins wrote:
>
>> Dimitri Sivanich wrote:
>>
>>> I'm wondering if anyone else has run into this problem, and whether they've
>>> found a resolution for it?
>>>
>>>
>>>
>> Hi Dimitri,
>> I have observed similar problems on my opensuse 11.0 box (2.6.25
>> based) when doing kernel builds. Often times other apps will get
>> starved (like firefox, etc). I was talking to peterz about it and he
>> pointed me at a commit that went into 28-rcx that fixes this (the sha
>> escapes me at the moment). I've been meaning to pull the patch into our
>> 2.6.25 and 2.6.27 trees to verify if it fixes it, but I have fairly high
>> confidence it will. Perhaps Peter can point you at the same patch and
>> you can give it a whirl. For the time being, can you see if 28 is fixed?
>>
>>
>
> 2.6.28 appears to be a little bit worse, actually:
>
Ugg...ok, well thanks for verifying. I will defer to Peter and Ingo on
this one.
Please keep me in the loop on any new findings here, as the problem
(assuming its the same bug) has been annoying me for months (but
obviously not enough for me to do something about it ;)
-Greg
On Mon, 2009-01-05 at 11:56 -0600, Dimitri Sivanich wrote:
> One place we've found this happens is in update_curr(), which calculates a
> delta_exec value as follows:
> delta_exec = (unsigned long)(now - curr->exec_start);
>
> Sometimes this value will be very large, as 'now' (the rq clock time) will
> be less than 'exec_start'. When this happens, __update_curr() will
> calculate a delta_exec_weighted based on this large value and add it to the
> thread's vruntime:
> curr->vruntime += delta_exec_weighted;
So you're saying your rq->clock = sched_clock_cpu(cpu) = sched_clock()
[on ia64] goes backwards?
If so, then that's an architecture bug, sched_clock() must never be seen
to go backwards!
On Mon, Jan 05, 2009 at 09:55:58PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-01-05 at 11:56 -0600, Dimitri Sivanich wrote:
> > One place we've found this happens is in update_curr(), which calculates a
> > delta_exec value as follows:
> > delta_exec = (unsigned long)(now - curr->exec_start);
> >
> > Sometimes this value will be very large, as 'now' (the rq clock time) will
> > be less than 'exec_start'. When this happens, __update_curr() will
> > calculate a delta_exec_weighted based on this large value and add it to the
> > thread's vruntime:
> > curr->vruntime += delta_exec_weighted;
>
> So you're saying your rq->clock = sched_clock_cpu(cpu) = sched_clock()
> [on ia64] goes backwards?
>
> If so, then that's an architecture bug, sched_clock() must never be seen
> to go backwards!
Actually, sched_clock() should not go backwards on any one cpu, but the readings will be different between cpus.
Also, we noticed the following code is being used for sched_clock_cpu():
u64 sched_clock_cpu(int cpu)
{
if (unlikely(!sched_clock_running))
return 0;
return sched_clock();
}
and is called when smp_processor_id() != cpu.
On Mon, Jan 05, 2009 at 03:54:29PM -0600, Dimitri Sivanich wrote:
> On Mon, Jan 05, 2009 at 09:55:58PM +0100, Peter Zijlstra wrote:
> > On Mon, 2009-01-05 at 11:56 -0600, Dimitri Sivanich wrote:
> > > One place we've found this happens is in update_curr(), which calculates a
> > > delta_exec value as follows:
> > > delta_exec = (unsigned long)(now - curr->exec_start);
> > >
> > > Sometimes this value will be very large, as 'now' (the rq clock time) will
> > > be less than 'exec_start'. When this happens, __update_curr() will
> > > calculate a delta_exec_weighted based on this large value and add it to the
> > > thread's vruntime:
> > > curr->vruntime += delta_exec_weighted;
> >
> > So you're saying your rq->clock = sched_clock_cpu(cpu) = sched_clock()
> > [on ia64] goes backwards?
> >
> > If so, then that's an architecture bug, sched_clock() must never be seen
> > to go backwards!
>
> Actually, sched_clock() should not go backwards on any one cpu, but the readings will be different between cpus.
>
> Also, we noticed the following code is being used for sched_clock_cpu():
> u64 sched_clock_cpu(int cpu)
> {
> if (unlikely(!sched_clock_running))
> return 0;
> return sched_clock();
> }
>
> and is called when smp_processor_id() != cpu.
And sure enough, the rq->clock is sometimes going backwards.
The comment for sched_clock() in arch/ia64/kernel/head.S:
* Return a CPU-local timestamp in nano-seconds. This timestamp is
* NOT synchronized across CPUs its return value must never be
* compared against the values returned on another CPU. The usage in
* kernel/sched.c ensures that.
We will try this with CONFIG_HAVE_UNSTABLE_SCHED_CLOCKS.
Thanks for the heads up!
Peter,
On Mon, Jan 05, 2009 at 04:36:21PM -0600, Dimitri Sivanich wrote:
> And sure enough, the rq->clock is sometimes going backwards.
>
> The comment for sched_clock() in arch/ia64/kernel/head.S:
> * Return a CPU-local timestamp in nano-seconds. This timestamp is
> * NOT synchronized across CPUs its return value must never be
> * compared against the values returned on another CPU. The usage in
> * kernel/sched.c ensures that.
>
> We will try this with CONFIG_HAVE_UNSTABLE_SCHED_CLOCKS.
>
The testcase does indeed run with CONFIG_HAVE_UNSTABLE_SCHED_CLOCKS configured.
On Mon, 2009-01-05 at 17:02 -0600, Dimitri Sivanich wrote:
> Peter,
>
> On Mon, Jan 05, 2009 at 04:36:21PM -0600, Dimitri Sivanich wrote:
> > And sure enough, the rq->clock is sometimes going backwards.
> >
> > The comment for sched_clock() in arch/ia64/kernel/head.S:
> > * Return a CPU-local timestamp in nano-seconds. This timestamp is
> > * NOT synchronized across CPUs its return value must never be
> > * compared against the values returned on another CPU. The usage in
> > * kernel/sched.c ensures that.
> >
> > We will try this with CONFIG_HAVE_UNSTABLE_SCHED_CLOCKS.
> >
>
> The testcase does indeed run with CONFIG_HAVE_UNSTABLE_SCHED_CLOCKS configured.
Ok, glad it worked out, thanks!
On Mon, 05 Jan 2009 13:59:32 -0500, Gregory Haskins wrote:
> Dimitri Sivanich wrote:
>> Hi Greg,
>>
>> On Mon, Jan 05, 2009 at 01:44:41PM -0500, Gregory Haskins wrote:
>>
>>> Dimitri Sivanich wrote:
>>>
>>>> I'm wondering if anyone else has run into this problem, and whether
>>>> they've found a resolution for it?
>>>>
>>>>
>>>>
>>> Hi Dimitri,
>>> I have observed similar problems on my opensuse 11.0 box (2.6.25
>>> based) when doing kernel builds. Often times other apps will get
>>> starved (like firefox, etc). I was talking to peterz about it and he
>>> pointed me at a commit that went into 28-rcx that fixes this (the sha
>>> escapes me at the moment). I've been meaning to pull the patch into
>>> our 2.6.25 and 2.6.27 trees to verify if it fixes it, but I have
>>> fairly high confidence it will. Perhaps Peter can point you at the
>>> same patch and you can give it a whirl. For the time being, can you
>>> see if 28 is fixed?
>>>
>>>
>>>
>> 2.6.28 appears to be a little bit worse, actually:
>>
>>
> Ugg...ok, well thanks for verifying. I will defer to Peter and Ingo on
> this one.
>
> Please keep me in the loop on any new findings here, as the problem
> (assuming its the same bug) has been annoying me for months (but
> obviously not enough for me to do something about it ;)
>
> -Greg
Could this be the same issue as we have in this bug
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/276476
did anyone have a pointer to the patch ??
I'm on .28 and it's much much better but still not 100%.